Collector implementation

2026-03-14 20:07:22 +01:00
parent 4393ae2726
commit 6ca296b2e8
16 changed files with 3052 additions and 0 deletions
@@ -0,0 +1,299 @@
+# nginx-logtail User Guide
+
+## Overview
+
+nginx-logtail is a three-component system for real-time traffic analysis across a cluster of nginx
+machines. It answers questions like:
+
+- Which client prefix is causing the most HTTP 429s right now?
+- Which website is getting the most 503s over the last 24 hours?
+- Which nginx machine is the busiest?
+- Is there a DDoS in progress, and from where?
+
+Components:
+
+| Binary        | Runs on          | Role                                               |
+|---------------|------------------|----------------------------------------------------|
+| `collector`   | each nginx host  | Tails log files, aggregates in memory, serves gRPC |
+| `aggregator`  | central host     | Merges all collectors, serves unified gRPC         |
+| `frontend`    | central host     | HTTP dashboard with drilldown UI                   |
+| `cli`         | operator laptop  | Shell queries against collector or aggregator      |
+
+---
+
+## nginx Configuration
+
+Add the `logtail` log format to your `nginx.conf` and apply it to each `server` block:
+
+```nginx
+http {
+    log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time';
+
+    server {
+        access_log /var/log/nginx/access.log logtail;
+        # or per-vhost:
+        access_log /var/log/nginx/www.example.com.access.log logtail;
+    }
+}
+```
+
+The format is tab-separated with fixed field positions. Query strings are stripped from the URI
+by the collector at ingest time — only the path is tracked.
+
+---
+
+## Building
+
+```bash
+git clone https://git.ipng.ch/ipng/nginx-logtail
+cd nginx-logtail
+go build ./cmd/collector/
+go build ./cmd/aggregator/
+go build ./cmd/frontend/
+go build ./cmd/cli/
+```
+
+Requires Go 1.21+. No CGO, no external runtime dependencies.
+
+---
+
+## Collector
+
+Runs on each nginx machine. Tails log files, maintains in-memory top-K counters across six time
+windows, and exposes a gRPC interface for the aggregator (and directly for the CLI).
+
+### Flags
+
+| Flag           | Default      | Description                                               |
+|----------------|--------------|-----------------------------------------------------------|
+| `--listen`     | `:9090`      | gRPC listen address                                       |
+| `--logs`       | —            | Comma-separated log file paths or glob patterns           |
+| `--logs-file`  | —            | File containing one log path/glob per line                |
+| `--source`     | hostname     | Name for this collector in query responses                |
+| `--v4prefix`   | `24`         | IPv4 prefix length for client bucketing (e.g. /24 → /23) |
+| `--v6prefix`   | `48`         | IPv6 prefix length for client bucketing                   |
+
+At least one of `--logs` or `--logs-file` is required.
+
+### Examples
+
+```bash
+# Single file
+./collector --logs /var/log/nginx/access.log
+
+# Multiple files via glob (one inotify instance regardless of count)
+./collector --logs "/var/log/nginx/*/access.log"
+
+# Many files via a config file
+./collector --logs-file /etc/nginx-logtail/logs.conf
+
+# Custom prefix lengths and listen address
+./collector \
+  --logs "/var/log/nginx/*.log" \
+  --listen :9091 \
+  --source nginx3.prod \
+  --v4prefix 24 \
+  --v6prefix 48
+```
+
+### logs-file format
+
+One path or glob pattern per line. Lines starting with `#` are ignored.
+
+```
+# /etc/nginx-logtail/logs.conf
+/var/log/nginx/access.log
+/var/log/nginx/*/access.log
+/var/log/nginx/api.example.com.access.log
+```
+
+### Log rotation
+
+The collector handles logrotate automatically. On `RENAME`/`REMOVE` events it drains the old file
+descriptor to EOF (so no lines are lost), then retries opening the original path with backoff until
+the new file appears. No restart or SIGHUP required.
+
+### Memory usage
+
+The collector is designed to stay well under 1 GB:
+
+| Structure                   | Max entries | Approx size |
+|-----------------------------|-------------|-------------|
+| Live map (current minute)   | 100 000     | ~19 MB      |
+| Fine ring (60 × 1-min)      | 60 × 50 000 | ~558 MB     |
+| Coarse ring (288 × 5-min)   | 288 × 5 000 | ~268 MB     |
+| **Total**                   |             | **~845 MB** |
+
+When the live map reaches 100 000 distinct 4-tuples, new keys are dropped for the rest of that
+minute. Existing keys continue to accumulate counts. The cap resets at each minute rotation.
+
+### Time windows
+
+Data is served from two tiered ring buffers:
+
+| Window | Source ring | Resolution |
+|--------|-------------|------------|
+| 1 min  | fine        | 1 minute   |
+| 5 min  | fine        | 1 minute   |
+| 15 min | fine        | 1 minute   |
+| 60 min | fine        | 1 minute   |
+| 6 h    | coarse      | 5 minutes  |
+| 24 h   | coarse      | 5 minutes  |
+
+History is lost on restart — the collector resumes tailing immediately but all ring buffers start
+empty. The fine ring fills in 1 hour; the coarse ring fills in 24 hours.
+
+### Systemd unit example
+
+```ini
+[Unit]
+Description=nginx-logtail collector
+After=network.target
+
+[Service]
+ExecStart=/usr/local/bin/collector \
+  --logs-file /etc/nginx-logtail/logs.conf \
+  --listen :9090 \
+  --source %H
+Restart=on-failure
+RestartSec=5
+
+[Install]
+WantedBy=multi-user.target
+```
+
+---
+
+## Aggregator
+
+Runs on a central machine. Connects to all collectors via gRPC streaming, merges their snapshots
+into a unified view, and serves the same gRPC interface as the collector.
+
+### Flags
+
+| Flag           | Default   | Description                                            |
+|----------------|-----------|--------------------------------------------------------|
+| `--listen`     | `:9091`   | gRPC listen address                                    |
+| `--collectors` | —         | Comma-separated `host:port` addresses of collectors    |
+| `--source`     | hostname  | Name for this aggregator in query responses            |
+
+### Example
+
+```bash
+./aggregator \
+  --collectors nginx1:9090,nginx2:9090,nginx3:9090 \
+  --listen :9091
+```
+
+The aggregator tolerates collector failures — if one collector is unreachable, results from the
+remaining collectors are returned with a warning. It reconnects automatically with backoff.
+
+---
+
+## Frontend
+
+HTTP dashboard. Connects to the aggregator (or directly to a single collector for debugging).
+
+### Flags
+
+| Flag        | Default      | Description                           |
+|-------------|--------------|---------------------------------------|
+| `--listen`  | `:8080`      | HTTP listen address                   |
+| `--target`  | `localhost:9091` | gRPC address of aggregator or collector |
+
+### Usage
+
+Navigate to `http://your-host:8080`. The dashboard shows a ranked table of the top entries for
+the selected dimension and time window.
+
+**Filter controls:**
+- Click any row to add that value as a filter (e.g. click a website to restrict to it)
+- The filter breadcrumb at the top shows all active filters; click any token to remove it
+- Use the window tabs to switch between 1m / 5m / 15m / 60m / 6h / 24h
+- The page auto-refreshes every 30 seconds
+
+**Dimension selector:** switch between grouping by Website, Client Prefix, Request URI, or HTTP
+Status using the tabs at the top of the table.
+
+**Sparkline:** the trend chart shows total request count per bucket for the selected window and
+active filters. Useful for spotting sudden spikes.
+
+**URL sharing:** all filter state is in the URL query string — copy the URL to share a specific
+view with another operator.
+
+---
+
+## CLI
+
+A shell companion for one-off queries and debugging. Outputs JSON; pipe to `jq` for filtering.
+
+### Subcommands
+
+```
+cli topn   --target HOST:PORT [filters] [--by DIM] [--window W] [--n N] [--pretty]
+cli trend  --target HOST:PORT [filters] [--window W] [--pretty]
+cli stream --target HOST:PORT [--pretty]
+```
+
+### Common flags
+
+| Flag          | Default          | Description                                              |
+|---------------|------------------|----------------------------------------------------------|
+| `--target`    | `localhost:9090` | gRPC address of collector or aggregator                  |
+| `--by`        | `website`        | Dimension: `website` `prefix` `uri` `status`             |
+| `--window`    | `5m`             | Window: `1m` `5m` `15m` `60m` `6h` `24h`                |
+| `--n`         | `10`             | Number of results                                        |
+| `--website`   | —                | Filter to this website                                   |
+| `--prefix`    | —                | Filter to this client prefix                             |
+| `--uri`       | —                | Filter to this request URI                               |
+| `--status`    | —                | Filter to this HTTP status code                          |
+| `--pretty`    | false            | Pretty-print JSON                                        |
+
+### Examples
+
+```bash
+# Top 20 client prefixes sending 429s right now
+cli topn --target agg:9091 --window 1m --by prefix --status 429 --n 20 | jq '.entries[]'
+
+# Which website has the most 503s in the last 24h?
+cli topn --target agg:9091 --window 24h --by website --status 503
+
+# Trend of 429s on one site over 6h — pipe to a quick graph
+cli trend --target agg:9091 --window 6h --website api.example.com \
+  | jq '[.points[] | {t: .time, n: .count}]'
+
+# Watch live snapshots from one collector; alert on large entry counts
+cli stream --target nginx3:9090 | jq -c 'select(.entry_count > 50000)'
+
+# Query a single collector directly (bypass aggregator)
+cli topn --target nginx1:9090 --window 5m --by prefix --pretty
+```
+
+The `stream` subcommand emits one JSON object per line (NDJSON) and runs until interrupted.
+Exit code is non-zero on any gRPC error.
+
+---
+
+## Operational notes
+
+**No persistence.** All data is in-memory. A collector restart loses ring buffer history but
+resumes tailing the log file from the current position immediately.
+
+**No TLS.** Designed for trusted internal networks. If you need encryption in transit, put a
+TLS-terminating proxy (e.g. stunnel, nginx stream) in front of the gRPC port.
+
+**inotify limits.** The collector uses a single inotify instance regardless of how many files it
+tails. If you tail files across many different directories, check
+`/proc/sys/fs/inotify/max_user_watches` (default 8192); increase it if needed:
+```bash
+echo 65536 | sudo tee /proc/sys/fs/inotify/max_user_watches
+```
+
+**High-cardinality attacks.** If a DDoS sends traffic from thousands of unique /24 prefixes with
+unique URIs, the live map will hit its 100 000 entry cap and drop new keys for the rest of that
+minute. The top-K entries already tracked continue accumulating counts. This is by design — the
+cap prevents memory exhaustion under attack conditions.
+
+**Clock skew.** Trend sparklines are based on the collector's local clock. If collectors have
+significant clock skew, trend buckets from different collectors may not align precisely in the
+aggregator. NTP sync is recommended.