418 lines
15 KiB
Markdown
418 lines
15 KiB
Markdown
# nginx-logtail User Guide
|
||
|
||
## Overview
|
||
|
||
nginx-logtail is a four-component system for real-time traffic analysis across a cluster of nginx
|
||
machines. It answers questions like:
|
||
|
||
- Which client prefix is causing the most HTTP 429s right now?
|
||
- Which website is getting the most 503s over the last 24 hours?
|
||
- Which nginx machine is the busiest?
|
||
- Is there a DDoS in progress, and from where?
|
||
|
||
Components:
|
||
|
||
| Binary | Runs on | Role |
|
||
|---------------|------------------|----------------------------------------------------|
|
||
| `collector` | each nginx host | Tails log files, aggregates in memory, serves gRPC |
|
||
| `aggregator` | central host | Merges all collectors, serves unified gRPC |
|
||
| `frontend` | central host | HTTP dashboard with drilldown UI |
|
||
| `cli` | operator laptop | Shell queries against collector or aggregator |
|
||
|
||
---
|
||
|
||
## nginx Configuration
|
||
|
||
Add the `logtail` log format to your `nginx.conf` and apply it to each `server` block:
|
||
|
||
```nginx
|
||
http {
|
||
log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time';
|
||
|
||
server {
|
||
access_log /var/log/nginx/access.log logtail;
|
||
# or per-vhost:
|
||
access_log /var/log/nginx/www.example.com.access.log logtail;
|
||
}
|
||
}
|
||
```
|
||
|
||
The format is tab-separated with fixed field positions. Query strings are stripped from the URI
|
||
by the collector at ingest time — only the path is tracked.
|
||
|
||
---
|
||
|
||
## Building
|
||
|
||
```bash
|
||
git clone https://git.ipng.ch/ipng/nginx-logtail
|
||
cd nginx-logtail
|
||
go build ./cmd/collector/
|
||
go build ./cmd/aggregator/
|
||
go build ./cmd/frontend/
|
||
go build ./cmd/cli/
|
||
```
|
||
|
||
Requires Go 1.21+. No CGO, no external runtime dependencies.
|
||
|
||
---
|
||
|
||
## Collector
|
||
|
||
Runs on each nginx machine. Tails log files, maintains in-memory top-K counters across six time
|
||
windows, and exposes a gRPC interface for the aggregator (and directly for the CLI).
|
||
|
||
### Flags
|
||
|
||
| Flag | Default | Description |
|
||
|----------------|--------------|-----------------------------------------------------------|
|
||
| `--listen` | `:9090` | gRPC listen address |
|
||
| `--logs` | — | Comma-separated log file paths or glob patterns |
|
||
| `--logs-file` | — | File containing one log path/glob per line |
|
||
| `--source` | hostname | Name for this collector in query responses |
|
||
| `--v4prefix` | `24` | IPv4 prefix length for client bucketing (e.g. /24 → /23) |
|
||
| `--v6prefix` | `48` | IPv6 prefix length for client bucketing |
|
||
|
||
At least one of `--logs` or `--logs-file` is required.
|
||
|
||
### Examples
|
||
|
||
```bash
|
||
# Single file
|
||
./collector --logs /var/log/nginx/access.log
|
||
|
||
# Multiple files via glob (one inotify instance regardless of count)
|
||
./collector --logs "/var/log/nginx/*/access.log"
|
||
|
||
# Many files via a config file
|
||
./collector --logs-file /etc/nginx-logtail/logs.conf
|
||
|
||
# Custom prefix lengths and listen address
|
||
./collector \
|
||
--logs "/var/log/nginx/*.log" \
|
||
--listen :9091 \
|
||
--source nginx3.prod \
|
||
--v4prefix 24 \
|
||
--v6prefix 48
|
||
```
|
||
|
||
### logs-file format
|
||
|
||
One path or glob pattern per line. Lines starting with `#` are ignored.
|
||
|
||
```
|
||
# /etc/nginx-logtail/logs.conf
|
||
/var/log/nginx/access.log
|
||
/var/log/nginx/*/access.log
|
||
/var/log/nginx/api.example.com.access.log
|
||
```
|
||
|
||
### Log rotation
|
||
|
||
The collector handles logrotate automatically. On `RENAME`/`REMOVE` events it drains the old file
|
||
descriptor to EOF (so no lines are lost), then retries opening the original path with backoff until
|
||
the new file appears. No restart or SIGHUP required.
|
||
|
||
### Memory usage
|
||
|
||
The collector is designed to stay well under 1 GB:
|
||
|
||
| Structure | Max entries | Approx size |
|
||
|-----------------------------|-------------|-------------|
|
||
| Live map (current minute) | 100 000 | ~19 MB |
|
||
| Fine ring (60 × 1-min) | 60 × 50 000 | ~558 MB |
|
||
| Coarse ring (288 × 5-min) | 288 × 5 000 | ~268 MB |
|
||
| **Total** | | **~845 MB** |
|
||
|
||
When the live map reaches 100 000 distinct 4-tuples, new keys are dropped for the rest of that
|
||
minute. Existing keys continue to accumulate counts. The cap resets at each minute rotation.
|
||
|
||
### Time windows
|
||
|
||
Data is served from two tiered ring buffers:
|
||
|
||
| Window | Source ring | Resolution |
|
||
|--------|-------------|------------|
|
||
| 1 min | fine | 1 minute |
|
||
| 5 min | fine | 1 minute |
|
||
| 15 min | fine | 1 minute |
|
||
| 60 min | fine | 1 minute |
|
||
| 6 h | coarse | 5 minutes |
|
||
| 24 h | coarse | 5 minutes |
|
||
|
||
History is lost on restart — the collector resumes tailing immediately but all ring buffers start
|
||
empty. The fine ring fills in 1 hour; the coarse ring fills in 24 hours.
|
||
|
||
### Systemd unit example
|
||
|
||
```ini
|
||
[Unit]
|
||
Description=nginx-logtail collector
|
||
After=network.target
|
||
|
||
[Service]
|
||
ExecStart=/usr/local/bin/collector \
|
||
--logs-file /etc/nginx-logtail/logs.conf \
|
||
--listen :9090 \
|
||
--source %H
|
||
Restart=on-failure
|
||
RestartSec=5
|
||
|
||
[Install]
|
||
WantedBy=multi-user.target
|
||
```
|
||
|
||
---
|
||
|
||
## Aggregator
|
||
|
||
Runs on a central machine. Subscribes to the `StreamSnapshots` push stream from every configured
|
||
collector, merges their snapshots into a unified in-memory cache, and serves the same gRPC
|
||
interface as the collector. The frontend and CLI query the aggregator exactly as they would query
|
||
a single collector.
|
||
|
||
### Flags
|
||
|
||
| Flag | Default | Description |
|
||
|----------------|-----------|--------------------------------------------------------|
|
||
| `--listen` | `:9091` | gRPC listen address |
|
||
| `--collectors` | — | Comma-separated `host:port` addresses of collectors |
|
||
| `--source` | hostname | Name for this aggregator in query responses |
|
||
|
||
`--collectors` is required; the aggregator exits immediately if it is not set.
|
||
|
||
### Example
|
||
|
||
```bash
|
||
./aggregator \
|
||
--collectors nginx1:9090,nginx2:9090,nginx3:9090 \
|
||
--listen :9091 \
|
||
--source agg-prod
|
||
```
|
||
|
||
### Fault tolerance
|
||
|
||
The aggregator reconnects to each collector independently with exponential backoff (100 ms →
|
||
doubles → cap 30 s). After 3 consecutive failures to a collector it marks that collector
|
||
**degraded**: its last-known contribution is subtracted from the merged view so stale counts
|
||
do not accumulate. When the collector recovers and sends a new snapshot, it is automatically
|
||
reintegrated. The remaining collectors continue serving queries throughout.
|
||
|
||
### Memory
|
||
|
||
The aggregator's merged cache uses the same tiered ring-buffer structure as the collector
|
||
(60 × 1-min fine, 288 × 5-min coarse) but holds at most top-50 000 entries per fine bucket
|
||
and top-5 000 per coarse bucket across all collectors combined. Memory footprint is roughly
|
||
the same as one collector (~845 MB worst case).
|
||
|
||
### Systemd unit example
|
||
|
||
```ini
|
||
[Unit]
|
||
Description=nginx-logtail aggregator
|
||
After=network.target
|
||
|
||
[Service]
|
||
ExecStart=/usr/local/bin/aggregator \
|
||
--collectors nginx1:9090,nginx2:9090,nginx3:9090 \
|
||
--listen :9091 \
|
||
--source %H
|
||
Restart=on-failure
|
||
RestartSec=5
|
||
|
||
[Install]
|
||
WantedBy=multi-user.target
|
||
```
|
||
|
||
---
|
||
|
||
## Frontend
|
||
|
||
HTTP dashboard. Connects to the aggregator (or directly to a single collector for debugging).
|
||
Zero JavaScript — server-rendered HTML with inline SVG sparklines.
|
||
|
||
### Flags
|
||
|
||
| Flag | Default | Description |
|
||
|-------------|-------------------|--------------------------------------------------|
|
||
| `--listen` | `:8080` | HTTP listen address |
|
||
| `--target` | `localhost:9091` | Default gRPC endpoint (aggregator or collector) |
|
||
| `--n` | `25` | Default number of table rows |
|
||
| `--refresh` | `30` | Auto-refresh interval in seconds; `0` to disable |
|
||
|
||
### Usage
|
||
|
||
Navigate to `http://your-host:8080`. The dashboard shows a ranked table of the top entries for
|
||
the selected dimension and time window.
|
||
|
||
**Window tabs** — switch between `1m / 5m / 15m / 60m / 6h / 24h`. Only the window changes;
|
||
all active filters are preserved.
|
||
|
||
**Dimension tabs** — switch between grouping by `website / prefix / uri / status`.
|
||
|
||
**Drilldown** — click any table row to add that value as a filter and advance to the next
|
||
dimension in the hierarchy:
|
||
|
||
```
|
||
website → client prefix → request URI → HTTP status → website (cycles)
|
||
```
|
||
|
||
Example: click `example.com` in the website view to see which client prefixes are hitting it;
|
||
click a prefix there to see which URIs it is requesting; and so on.
|
||
|
||
**Breadcrumb strip** — shows all active filters above the table. Click `×` next to any token
|
||
to remove just that filter, keeping the others.
|
||
|
||
**Sparkline** — inline SVG trend chart showing total request count per time bucket for the
|
||
current filter state. Useful for spotting sudden spikes or sustained DDoS ramps.
|
||
|
||
**URL sharing** — all filter state is in the URL query string (`w`, `by`, `f_website`,
|
||
`f_prefix`, `f_uri`, `f_status`, `n`). Copy the URL to share an exact view with another
|
||
operator, or bookmark a recurring query.
|
||
|
||
**JSON output** — append `&raw=1` to any URL to receive the TopN result as JSON instead of
|
||
HTML. Useful for scripting without the CLI binary:
|
||
|
||
```bash
|
||
curl -s 'http://frontend:8080/?f_status=429&by=prefix&w=1m&raw=1' | jq '.entries[0]'
|
||
```
|
||
|
||
**Target override** — append `?target=host:port` to point the frontend at a different gRPC
|
||
endpoint for that request (useful for comparing a single collector against the aggregator):
|
||
|
||
```bash
|
||
http://frontend:8080/?target=nginx3:9090&w=5m
|
||
```
|
||
|
||
---
|
||
|
||
## CLI
|
||
|
||
A shell companion for one-off queries and debugging. Works with any `LogtailService` endpoint —
|
||
collector or aggregator. Accepts multiple targets, fans out concurrently, and labels each result.
|
||
Default output is a human-readable table; add `--json` for machine-readable NDJSON.
|
||
|
||
### Subcommands
|
||
|
||
```
|
||
logtail-cli topn [flags] ranked label → count table
|
||
logtail-cli trend [flags] per-bucket time series
|
||
logtail-cli stream [flags] live snapshot feed (runs until Ctrl-C)
|
||
```
|
||
|
||
### Shared flags (all subcommands)
|
||
|
||
| Flag | Default | Description |
|
||
|---------------|------------------|----------------------------------------------------------|
|
||
| `--target` | `localhost:9090` | Comma-separated `host:port` list; queries fan out to all |
|
||
| `--json` | false | Emit newline-delimited JSON instead of a table |
|
||
| `--website` | — | Filter to this website |
|
||
| `--prefix` | — | Filter to this client prefix |
|
||
| `--uri` | — | Filter to this request URI |
|
||
| `--status` | — | Filter to this HTTP status code (integer) |
|
||
|
||
### `topn` flags
|
||
|
||
| Flag | Default | Description |
|
||
|---------------|------------|----------------------------------------------------------|
|
||
| `--n` | `10` | Number of entries |
|
||
| `--window` | `5m` | `1m` `5m` `15m` `60m` `6h` `24h` |
|
||
| `--group-by` | `website` | `website` `prefix` `uri` `status` |
|
||
|
||
### `trend` flags
|
||
|
||
| Flag | Default | Description |
|
||
|---------------|------------|----------------------------------------------------------|
|
||
| `--window` | `5m` | `1m` `5m` `15m` `60m` `6h` `24h` |
|
||
|
||
### Output format
|
||
|
||
**Table** (default — single target, no header):
|
||
```
|
||
RANK COUNT LABEL
|
||
1 18 432 example.com
|
||
2 4 211 other.com
|
||
```
|
||
|
||
**Multi-target** — each target gets a labeled section:
|
||
```
|
||
=== col-1 (nginx1:9090) ===
|
||
RANK COUNT LABEL
|
||
1 10 000 example.com
|
||
|
||
=== agg-prod (agg:9091) ===
|
||
RANK COUNT LABEL
|
||
1 18 432 example.com
|
||
```
|
||
|
||
**JSON** (`--json`) — one object per target, suitable for `jq`:
|
||
```json
|
||
{"source":"agg-prod","target":"agg:9091","entries":[{"label":"example.com","count":18432},...]}
|
||
```
|
||
|
||
**`stream` JSON** — one object per snapshot received (NDJSON), runs until interrupted:
|
||
```json
|
||
{"ts":1773516180,"source":"col-1","target":"nginx1:9090","total_entries":823,"top_label":"example.com","top_count":10000}
|
||
```
|
||
|
||
### Examples
|
||
|
||
```bash
|
||
# Top 20 client prefixes sending 429s right now
|
||
logtail-cli topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20
|
||
|
||
# Same query, pipe to jq for scripting
|
||
logtail-cli topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20 \
|
||
--json | jq '.entries[0]'
|
||
|
||
# Which website has the most 503s over the last 24h?
|
||
logtail-cli topn --target agg:9091 --window 24h --group-by website --status 503
|
||
|
||
# Drill: top URIs on one website over the last 60 minutes
|
||
logtail-cli topn --target agg:9091 --window 60m --group-by uri --website api.example.com
|
||
|
||
# Compare two collectors side by side in one command
|
||
logtail-cli topn --target nginx1:9090,nginx2:9090 --window 5m
|
||
|
||
# Query both a collector and the aggregator at once
|
||
logtail-cli topn --target nginx3:9090,agg:9091 --window 5m --group-by prefix
|
||
|
||
# Trend of total traffic over 6h (for a quick sparkline in the terminal)
|
||
logtail-cli trend --target agg:9091 --window 6h --json | jq '[.points[] | .count]'
|
||
|
||
# Watch live merged snapshots from the aggregator
|
||
logtail-cli stream --target agg:9091
|
||
|
||
# Watch two collectors simultaneously; each snapshot is labeled by source
|
||
logtail-cli stream --target nginx1:9090,nginx2:9090
|
||
```
|
||
|
||
The `stream` subcommand reconnects automatically after errors (5 s backoff) and runs until
|
||
interrupted with Ctrl-C. The `topn` and `trend` subcommands exit immediately after one response.
|
||
|
||
---
|
||
|
||
## Operational notes
|
||
|
||
**No persistence.** All data is in-memory. A collector restart loses ring buffer history but
|
||
resumes tailing the log file from the current position immediately.
|
||
|
||
**No TLS.** Designed for trusted internal networks. If you need encryption in transit, put a
|
||
TLS-terminating proxy (e.g. stunnel, nginx stream) in front of the gRPC port.
|
||
|
||
**inotify limits.** The collector uses a single inotify instance regardless of how many files it
|
||
tails. If you tail files across many different directories, check
|
||
`/proc/sys/fs/inotify/max_user_watches` (default 8192); increase it if needed:
|
||
```bash
|
||
echo 65536 | sudo tee /proc/sys/fs/inotify/max_user_watches
|
||
```
|
||
|
||
**High-cardinality attacks.** If a DDoS sends traffic from thousands of unique /24 prefixes with
|
||
unique URIs, the live map will hit its 100 000 entry cap and drop new keys for the rest of that
|
||
minute. The top-K entries already tracked continue accumulating counts. This is by design — the
|
||
cap prevents memory exhaustion under attack conditions.
|
||
|
||
**Clock skew.** Trend sparklines are based on the collector's local clock. If collectors have
|
||
significant clock skew, trend buckets from different collectors may not align precisely in the
|
||
aggregator. NTP sync is recommended.
|