300 lines
11 KiB
Markdown
300 lines
11 KiB
Markdown
# nginx-logtail User Guide
|
||
|
||
## Overview
|
||
|
||
nginx-logtail is a three-component system for real-time traffic analysis across a cluster of nginx
|
||
machines. It answers questions like:
|
||
|
||
- Which client prefix is causing the most HTTP 429s right now?
|
||
- Which website is getting the most 503s over the last 24 hours?
|
||
- Which nginx machine is the busiest?
|
||
- Is there a DDoS in progress, and from where?
|
||
|
||
Components:
|
||
|
||
| Binary | Runs on | Role |
|
||
|---------------|------------------|----------------------------------------------------|
|
||
| `collector` | each nginx host | Tails log files, aggregates in memory, serves gRPC |
|
||
| `aggregator` | central host | Merges all collectors, serves unified gRPC |
|
||
| `frontend` | central host | HTTP dashboard with drilldown UI |
|
||
| `cli` | operator laptop | Shell queries against collector or aggregator |
|
||
|
||
---
|
||
|
||
## nginx Configuration
|
||
|
||
Add the `logtail` log format to your `nginx.conf` and apply it to each `server` block:
|
||
|
||
```nginx
|
||
http {
|
||
log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time';
|
||
|
||
server {
|
||
access_log /var/log/nginx/access.log logtail;
|
||
# or per-vhost:
|
||
access_log /var/log/nginx/www.example.com.access.log logtail;
|
||
}
|
||
}
|
||
```
|
||
|
||
The format is tab-separated with fixed field positions. Query strings are stripped from the URI
|
||
by the collector at ingest time — only the path is tracked.
|
||
|
||
---
|
||
|
||
## Building
|
||
|
||
```bash
|
||
git clone https://git.ipng.ch/ipng/nginx-logtail
|
||
cd nginx-logtail
|
||
go build ./cmd/collector/
|
||
go build ./cmd/aggregator/
|
||
go build ./cmd/frontend/
|
||
go build ./cmd/cli/
|
||
```
|
||
|
||
Requires Go 1.21+. No CGO, no external runtime dependencies.
|
||
|
||
---
|
||
|
||
## Collector
|
||
|
||
Runs on each nginx machine. Tails log files, maintains in-memory top-K counters across six time
|
||
windows, and exposes a gRPC interface for the aggregator (and directly for the CLI).
|
||
|
||
### Flags
|
||
|
||
| Flag | Default | Description |
|
||
|----------------|--------------|-----------------------------------------------------------|
|
||
| `--listen` | `:9090` | gRPC listen address |
|
||
| `--logs` | — | Comma-separated log file paths or glob patterns |
|
||
| `--logs-file` | — | File containing one log path/glob per line |
|
||
| `--source` | hostname | Name for this collector in query responses |
|
||
| `--v4prefix` | `24` | IPv4 prefix length for client bucketing (e.g. /24 → /23) |
|
||
| `--v6prefix` | `48` | IPv6 prefix length for client bucketing |
|
||
|
||
At least one of `--logs` or `--logs-file` is required.
|
||
|
||
### Examples
|
||
|
||
```bash
|
||
# Single file
|
||
./collector --logs /var/log/nginx/access.log
|
||
|
||
# Multiple files via glob (one inotify instance regardless of count)
|
||
./collector --logs "/var/log/nginx/*/access.log"
|
||
|
||
# Many files via a config file
|
||
./collector --logs-file /etc/nginx-logtail/logs.conf
|
||
|
||
# Custom prefix lengths and listen address
|
||
./collector \
|
||
--logs "/var/log/nginx/*.log" \
|
||
--listen :9091 \
|
||
--source nginx3.prod \
|
||
--v4prefix 24 \
|
||
--v6prefix 48
|
||
```
|
||
|
||
### logs-file format
|
||
|
||
One path or glob pattern per line. Lines starting with `#` are ignored.
|
||
|
||
```
|
||
# /etc/nginx-logtail/logs.conf
|
||
/var/log/nginx/access.log
|
||
/var/log/nginx/*/access.log
|
||
/var/log/nginx/api.example.com.access.log
|
||
```
|
||
|
||
### Log rotation
|
||
|
||
The collector handles logrotate automatically. On `RENAME`/`REMOVE` events it drains the old file
|
||
descriptor to EOF (so no lines are lost), then retries opening the original path with backoff until
|
||
the new file appears. No restart or SIGHUP required.
|
||
|
||
### Memory usage
|
||
|
||
The collector is designed to stay well under 1 GB:
|
||
|
||
| Structure | Max entries | Approx size |
|
||
|-----------------------------|-------------|-------------|
|
||
| Live map (current minute) | 100 000 | ~19 MB |
|
||
| Fine ring (60 × 1-min) | 60 × 50 000 | ~558 MB |
|
||
| Coarse ring (288 × 5-min) | 288 × 5 000 | ~268 MB |
|
||
| **Total** | | **~845 MB** |
|
||
|
||
When the live map reaches 100 000 distinct 4-tuples, new keys are dropped for the rest of that
|
||
minute. Existing keys continue to accumulate counts. The cap resets at each minute rotation.
|
||
|
||
### Time windows
|
||
|
||
Data is served from two tiered ring buffers:
|
||
|
||
| Window | Source ring | Resolution |
|
||
|--------|-------------|------------|
|
||
| 1 min | fine | 1 minute |
|
||
| 5 min | fine | 1 minute |
|
||
| 15 min | fine | 1 minute |
|
||
| 60 min | fine | 1 minute |
|
||
| 6 h | coarse | 5 minutes |
|
||
| 24 h | coarse | 5 minutes |
|
||
|
||
History is lost on restart — the collector resumes tailing immediately but all ring buffers start
|
||
empty. The fine ring fills in 1 hour; the coarse ring fills in 24 hours.
|
||
|
||
### Systemd unit example
|
||
|
||
```ini
|
||
[Unit]
|
||
Description=nginx-logtail collector
|
||
After=network.target
|
||
|
||
[Service]
|
||
ExecStart=/usr/local/bin/collector \
|
||
--logs-file /etc/nginx-logtail/logs.conf \
|
||
--listen :9090 \
|
||
--source %H
|
||
Restart=on-failure
|
||
RestartSec=5
|
||
|
||
[Install]
|
||
WantedBy=multi-user.target
|
||
```
|
||
|
||
---
|
||
|
||
## Aggregator
|
||
|
||
Runs on a central machine. Connects to all collectors via gRPC streaming, merges their snapshots
|
||
into a unified view, and serves the same gRPC interface as the collector.
|
||
|
||
### Flags
|
||
|
||
| Flag | Default | Description |
|
||
|----------------|-----------|--------------------------------------------------------|
|
||
| `--listen` | `:9091` | gRPC listen address |
|
||
| `--collectors` | — | Comma-separated `host:port` addresses of collectors |
|
||
| `--source` | hostname | Name for this aggregator in query responses |
|
||
|
||
### Example
|
||
|
||
```bash
|
||
./aggregator \
|
||
--collectors nginx1:9090,nginx2:9090,nginx3:9090 \
|
||
--listen :9091
|
||
```
|
||
|
||
The aggregator tolerates collector failures — if one collector is unreachable, results from the
|
||
remaining collectors are returned with a warning. It reconnects automatically with backoff.
|
||
|
||
---
|
||
|
||
## Frontend
|
||
|
||
HTTP dashboard. Connects to the aggregator (or directly to a single collector for debugging).
|
||
|
||
### Flags
|
||
|
||
| Flag | Default | Description |
|
||
|-------------|--------------|---------------------------------------|
|
||
| `--listen` | `:8080` | HTTP listen address |
|
||
| `--target` | `localhost:9091` | gRPC address of aggregator or collector |
|
||
|
||
### Usage
|
||
|
||
Navigate to `http://your-host:8080`. The dashboard shows a ranked table of the top entries for
|
||
the selected dimension and time window.
|
||
|
||
**Filter controls:**
|
||
- Click any row to add that value as a filter (e.g. click a website to restrict to it)
|
||
- The filter breadcrumb at the top shows all active filters; click any token to remove it
|
||
- Use the window tabs to switch between 1m / 5m / 15m / 60m / 6h / 24h
|
||
- The page auto-refreshes every 30 seconds
|
||
|
||
**Dimension selector:** switch between grouping by Website, Client Prefix, Request URI, or HTTP
|
||
Status using the tabs at the top of the table.
|
||
|
||
**Sparkline:** the trend chart shows total request count per bucket for the selected window and
|
||
active filters. Useful for spotting sudden spikes.
|
||
|
||
**URL sharing:** all filter state is in the URL query string — copy the URL to share a specific
|
||
view with another operator.
|
||
|
||
---
|
||
|
||
## CLI
|
||
|
||
A shell companion for one-off queries and debugging. Outputs JSON; pipe to `jq` for filtering.
|
||
|
||
### Subcommands
|
||
|
||
```
|
||
cli topn --target HOST:PORT [filters] [--by DIM] [--window W] [--n N] [--pretty]
|
||
cli trend --target HOST:PORT [filters] [--window W] [--pretty]
|
||
cli stream --target HOST:PORT [--pretty]
|
||
```
|
||
|
||
### Common flags
|
||
|
||
| Flag | Default | Description |
|
||
|---------------|------------------|----------------------------------------------------------|
|
||
| `--target` | `localhost:9090` | gRPC address of collector or aggregator |
|
||
| `--by` | `website` | Dimension: `website` `prefix` `uri` `status` |
|
||
| `--window` | `5m` | Window: `1m` `5m` `15m` `60m` `6h` `24h` |
|
||
| `--n` | `10` | Number of results |
|
||
| `--website` | — | Filter to this website |
|
||
| `--prefix` | — | Filter to this client prefix |
|
||
| `--uri` | — | Filter to this request URI |
|
||
| `--status` | — | Filter to this HTTP status code |
|
||
| `--pretty` | false | Pretty-print JSON |
|
||
|
||
### Examples
|
||
|
||
```bash
|
||
# Top 20 client prefixes sending 429s right now
|
||
cli topn --target agg:9091 --window 1m --by prefix --status 429 --n 20 | jq '.entries[]'
|
||
|
||
# Which website has the most 503s in the last 24h?
|
||
cli topn --target agg:9091 --window 24h --by website --status 503
|
||
|
||
# Trend of 429s on one site over 6h — pipe to a quick graph
|
||
cli trend --target agg:9091 --window 6h --website api.example.com \
|
||
| jq '[.points[] | {t: .time, n: .count}]'
|
||
|
||
# Watch live snapshots from one collector; alert on large entry counts
|
||
cli stream --target nginx3:9090 | jq -c 'select(.entry_count > 50000)'
|
||
|
||
# Query a single collector directly (bypass aggregator)
|
||
cli topn --target nginx1:9090 --window 5m --by prefix --pretty
|
||
```
|
||
|
||
The `stream` subcommand emits one JSON object per line (NDJSON) and runs until interrupted.
|
||
Exit code is non-zero on any gRPC error.
|
||
|
||
---
|
||
|
||
## Operational notes
|
||
|
||
**No persistence.** All data is in-memory. A collector restart loses ring buffer history but
|
||
resumes tailing the log file from the current position immediately.
|
||
|
||
**No TLS.** Designed for trusted internal networks. If you need encryption in transit, put a
|
||
TLS-terminating proxy (e.g. stunnel, nginx stream) in front of the gRPC port.
|
||
|
||
**inotify limits.** The collector uses a single inotify instance regardless of how many files it
|
||
tails. If you tail files across many different directories, check
|
||
`/proc/sys/fs/inotify/max_user_watches` (default 8192); increase it if needed:
|
||
```bash
|
||
echo 65536 | sudo tee /proc/sys/fs/inotify/max_user_watches
|
||
```
|
||
|
||
**High-cardinality attacks.** If a DDoS sends traffic from thousands of unique /24 prefixes with
|
||
unique URIs, the live map will hit its 100 000 entry cap and drop new keys for the rest of that
|
||
minute. The top-K entries already tracked continue accumulating counts. This is by design — the
|
||
cap prevents memory exhaustion under attack conditions.
|
||
|
||
**Clock skew.** Trend sparklines are based on the collector's local clock. If collectors have
|
||
significant clock skew, trend buckets from different collectors may not align precisely in the
|
||
aggregator. NTP sync is recommended.
|