diff --git a/docs/USERGUIDE.md b/docs/USERGUIDE.md index 0cdb720..78118ef 100644 --- a/docs/USERGUIDE.md +++ b/docs/USERGUIDE.md @@ -2,7 +2,7 @@ ## Overview -nginx-logtail is a three-component system for real-time traffic analysis across a cluster of nginx +nginx-logtail is a four-component system for real-time traffic analysis across a cluster of nginx machines. It answers questions like: - Which client prefix is causing the most HTTP 429s right now? @@ -166,8 +166,10 @@ WantedBy=multi-user.target ## Aggregator -Runs on a central machine. Connects to all collectors via gRPC streaming, merges their snapshots -into a unified view, and serves the same gRPC interface as the collector. +Runs on a central machine. Subscribes to the `StreamSnapshots` push stream from every configured +collector, merges their snapshots into a unified in-memory cache, and serves the same gRPC +interface as the collector. The frontend and CLI query the aggregator exactly as they would query +a single collector. ### Flags @@ -177,100 +179,216 @@ into a unified view, and serves the same gRPC interface as the collector. | `--collectors` | — | Comma-separated `host:port` addresses of collectors | | `--source` | hostname | Name for this aggregator in query responses | +`--collectors` is required; the aggregator exits immediately if it is not set. + ### Example ```bash ./aggregator \ --collectors nginx1:9090,nginx2:9090,nginx3:9090 \ - --listen :9091 + --listen :9091 \ + --source agg-prod ``` -The aggregator tolerates collector failures — if one collector is unreachable, results from the -remaining collectors are returned with a warning. It reconnects automatically with backoff. +### Fault tolerance + +The aggregator reconnects to each collector independently with exponential backoff (100 ms → +doubles → cap 30 s). After 3 consecutive failures to a collector it marks that collector +**degraded**: its last-known contribution is subtracted from the merged view so stale counts +do not accumulate. When the collector recovers and sends a new snapshot, it is automatically +reintegrated. The remaining collectors continue serving queries throughout. + +### Memory + +The aggregator's merged cache uses the same tiered ring-buffer structure as the collector +(60 × 1-min fine, 288 × 5-min coarse) but holds at most top-50 000 entries per fine bucket +and top-5 000 per coarse bucket across all collectors combined. Memory footprint is roughly +the same as one collector (~845 MB worst case). + +### Systemd unit example + +```ini +[Unit] +Description=nginx-logtail aggregator +After=network.target + +[Service] +ExecStart=/usr/local/bin/aggregator \ + --collectors nginx1:9090,nginx2:9090,nginx3:9090 \ + --listen :9091 \ + --source %H +Restart=on-failure +RestartSec=5 + +[Install] +WantedBy=multi-user.target +``` --- ## Frontend HTTP dashboard. Connects to the aggregator (or directly to a single collector for debugging). +Zero JavaScript — server-rendered HTML with inline SVG sparklines. ### Flags -| Flag | Default | Description | -|-------------|--------------|---------------------------------------| -| `--listen` | `:8080` | HTTP listen address | -| `--target` | `localhost:9091` | gRPC address of aggregator or collector | +| Flag | Default | Description | +|-------------|-------------------|--------------------------------------------------| +| `--listen` | `:8080` | HTTP listen address | +| `--target` | `localhost:9091` | Default gRPC endpoint (aggregator or collector) | +| `--n` | `25` | Default number of table rows | +| `--refresh` | `30` | Auto-refresh interval in seconds; `0` to disable | ### Usage Navigate to `http://your-host:8080`. The dashboard shows a ranked table of the top entries for the selected dimension and time window. -**Filter controls:** -- Click any row to add that value as a filter (e.g. click a website to restrict to it) -- The filter breadcrumb at the top shows all active filters; click any token to remove it -- Use the window tabs to switch between 1m / 5m / 15m / 60m / 6h / 24h -- The page auto-refreshes every 30 seconds +**Window tabs** — switch between `1m / 5m / 15m / 60m / 6h / 24h`. Only the window changes; +all active filters are preserved. -**Dimension selector:** switch between grouping by Website, Client Prefix, Request URI, or HTTP -Status using the tabs at the top of the table. +**Dimension tabs** — switch between grouping by `website / prefix / uri / status`. -**Sparkline:** the trend chart shows total request count per bucket for the selected window and -active filters. Useful for spotting sudden spikes. +**Drilldown** — click any table row to add that value as a filter and advance to the next +dimension in the hierarchy: -**URL sharing:** all filter state is in the URL query string — copy the URL to share a specific -view with another operator. +``` +website → client prefix → request URI → HTTP status → website (cycles) +``` + +Example: click `example.com` in the website view to see which client prefixes are hitting it; +click a prefix there to see which URIs it is requesting; and so on. + +**Breadcrumb strip** — shows all active filters above the table. Click `×` next to any token +to remove just that filter, keeping the others. + +**Sparkline** — inline SVG trend chart showing total request count per time bucket for the +current filter state. Useful for spotting sudden spikes or sustained DDoS ramps. + +**URL sharing** — all filter state is in the URL query string (`w`, `by`, `f_website`, +`f_prefix`, `f_uri`, `f_status`, `n`). Copy the URL to share an exact view with another +operator, or bookmark a recurring query. + +**JSON output** — append `&raw=1` to any URL to receive the TopN result as JSON instead of +HTML. Useful for scripting without the CLI binary: + +```bash +curl -s 'http://frontend:8080/?f_status=429&by=prefix&w=1m&raw=1' | jq '.entries[0]' +``` + +**Target override** — append `?target=host:port` to point the frontend at a different gRPC +endpoint for that request (useful for comparing a single collector against the aggregator): + +```bash +http://frontend:8080/?target=nginx3:9090&w=5m +``` --- ## CLI -A shell companion for one-off queries and debugging. Outputs JSON; pipe to `jq` for filtering. +A shell companion for one-off queries and debugging. Works with any `LogtailService` endpoint — +collector or aggregator. Accepts multiple targets, fans out concurrently, and labels each result. +Default output is a human-readable table; add `--json` for machine-readable NDJSON. ### Subcommands ``` -cli topn --target HOST:PORT [filters] [--by DIM] [--window W] [--n N] [--pretty] -cli trend --target HOST:PORT [filters] [--window W] [--pretty] -cli stream --target HOST:PORT [--pretty] +logtail-cli topn [flags] ranked label → count table +logtail-cli trend [flags] per-bucket time series +logtail-cli stream [flags] live snapshot feed (runs until Ctrl-C) ``` -### Common flags +### Shared flags (all subcommands) | Flag | Default | Description | |---------------|------------------|----------------------------------------------------------| -| `--target` | `localhost:9090` | gRPC address of collector or aggregator | -| `--by` | `website` | Dimension: `website` `prefix` `uri` `status` | -| `--window` | `5m` | Window: `1m` `5m` `15m` `60m` `6h` `24h` | -| `--n` | `10` | Number of results | +| `--target` | `localhost:9090` | Comma-separated `host:port` list; queries fan out to all | +| `--json` | false | Emit newline-delimited JSON instead of a table | | `--website` | — | Filter to this website | | `--prefix` | — | Filter to this client prefix | | `--uri` | — | Filter to this request URI | -| `--status` | — | Filter to this HTTP status code | -| `--pretty` | false | Pretty-print JSON | +| `--status` | — | Filter to this HTTP status code (integer) | + +### `topn` flags + +| Flag | Default | Description | +|---------------|------------|----------------------------------------------------------| +| `--n` | `10` | Number of entries | +| `--window` | `5m` | `1m` `5m` `15m` `60m` `6h` `24h` | +| `--group-by` | `website` | `website` `prefix` `uri` `status` | + +### `trend` flags + +| Flag | Default | Description | +|---------------|------------|----------------------------------------------------------| +| `--window` | `5m` | `1m` `5m` `15m` `60m` `6h` `24h` | + +### Output format + +**Table** (default — single target, no header): +``` +RANK COUNT LABEL + 1 18 432 example.com + 2 4 211 other.com +``` + +**Multi-target** — each target gets a labeled section: +``` +=== col-1 (nginx1:9090) === +RANK COUNT LABEL + 1 10 000 example.com + +=== agg-prod (agg:9091) === +RANK COUNT LABEL + 1 18 432 example.com +``` + +**JSON** (`--json`) — one object per target, suitable for `jq`: +```json +{"source":"agg-prod","target":"agg:9091","entries":[{"label":"example.com","count":18432},...]} +``` + +**`stream` JSON** — one object per snapshot received (NDJSON), runs until interrupted: +```json +{"ts":1773516180,"source":"col-1","target":"nginx1:9090","total_entries":823,"top_label":"example.com","top_count":10000} +``` ### Examples ```bash # Top 20 client prefixes sending 429s right now -cli topn --target agg:9091 --window 1m --by prefix --status 429 --n 20 | jq '.entries[]' +logtail-cli topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20 -# Which website has the most 503s in the last 24h? -cli topn --target agg:9091 --window 24h --by website --status 503 +# Same query, pipe to jq for scripting +logtail-cli topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20 \ + --json | jq '.entries[0]' -# Trend of 429s on one site over 6h — pipe to a quick graph -cli trend --target agg:9091 --window 6h --website api.example.com \ - | jq '[.points[] | {t: .time, n: .count}]' +# Which website has the most 503s over the last 24h? +logtail-cli topn --target agg:9091 --window 24h --group-by website --status 503 -# Watch live snapshots from one collector; alert on large entry counts -cli stream --target nginx3:9090 | jq -c 'select(.entry_count > 50000)' +# Drill: top URIs on one website over the last 60 minutes +logtail-cli topn --target agg:9091 --window 60m --group-by uri --website api.example.com -# Query a single collector directly (bypass aggregator) -cli topn --target nginx1:9090 --window 5m --by prefix --pretty +# Compare two collectors side by side in one command +logtail-cli topn --target nginx1:9090,nginx2:9090 --window 5m + +# Query both a collector and the aggregator at once +logtail-cli topn --target nginx3:9090,agg:9091 --window 5m --group-by prefix + +# Trend of total traffic over 6h (for a quick sparkline in the terminal) +logtail-cli trend --target agg:9091 --window 6h --json | jq '[.points[] | .count]' + +# Watch live merged snapshots from the aggregator +logtail-cli stream --target agg:9091 + +# Watch two collectors simultaneously; each snapshot is labeled by source +logtail-cli stream --target nginx1:9090,nginx2:9090 ``` -The `stream` subcommand emits one JSON object per line (NDJSON) and runs until interrupted. -Exit code is non-zero on any gRPC error. +The `stream` subcommand reconnects automatically after errors (5 s backoff) and runs until +interrupted with Ctrl-C. The `topn` and `trend` subcommands exit immediately after one response. ---