Update docs with collector, aggregator, CLI and frontend

This commit is contained in:
2026-03-14 20:45:34 +01:00
parent 4369e66dee
commit c092561af2

View File

@@ -2,7 +2,7 @@
## Overview
nginx-logtail is a three-component system for real-time traffic analysis across a cluster of nginx
nginx-logtail is a four-component system for real-time traffic analysis across a cluster of nginx
machines. It answers questions like:
- Which client prefix is causing the most HTTP 429s right now?
@@ -166,8 +166,10 @@ WantedBy=multi-user.target
## Aggregator
Runs on a central machine. Connects to all collectors via gRPC streaming, merges their snapshots
into a unified view, and serves the same gRPC interface as the collector.
Runs on a central machine. Subscribes to the `StreamSnapshots` push stream from every configured
collector, merges their snapshots into a unified in-memory cache, and serves the same gRPC
interface as the collector. The frontend and CLI query the aggregator exactly as they would query
a single collector.
### Flags
@@ -177,100 +179,216 @@ into a unified view, and serves the same gRPC interface as the collector.
| `--collectors` | — | Comma-separated `host:port` addresses of collectors |
| `--source` | hostname | Name for this aggregator in query responses |
`--collectors` is required; the aggregator exits immediately if it is not set.
### Example
```bash
./aggregator \
--collectors nginx1:9090,nginx2:9090,nginx3:9090 \
--listen :9091
--listen :9091 \
--source agg-prod
```
The aggregator tolerates collector failures — if one collector is unreachable, results from the
remaining collectors are returned with a warning. It reconnects automatically with backoff.
### Fault tolerance
The aggregator reconnects to each collector independently with exponential backoff (100 ms →
doubles → cap 30 s). After 3 consecutive failures to a collector it marks that collector
**degraded**: its last-known contribution is subtracted from the merged view so stale counts
do not accumulate. When the collector recovers and sends a new snapshot, it is automatically
reintegrated. The remaining collectors continue serving queries throughout.
### Memory
The aggregator's merged cache uses the same tiered ring-buffer structure as the collector
(60 × 1-min fine, 288 × 5-min coarse) but holds at most top-50 000 entries per fine bucket
and top-5 000 per coarse bucket across all collectors combined. Memory footprint is roughly
the same as one collector (~845 MB worst case).
### Systemd unit example
```ini
[Unit]
Description=nginx-logtail aggregator
After=network.target
[Service]
ExecStart=/usr/local/bin/aggregator \
--collectors nginx1:9090,nginx2:9090,nginx3:9090 \
--listen :9091 \
--source %H
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
```
---
## Frontend
HTTP dashboard. Connects to the aggregator (or directly to a single collector for debugging).
Zero JavaScript — server-rendered HTML with inline SVG sparklines.
### Flags
| Flag | Default | Description |
|-------------|--------------|---------------------------------------|
| `--listen` | `:8080` | HTTP listen address |
| `--target` | `localhost:9091` | gRPC address of aggregator or collector |
| Flag | Default | Description |
|-------------|-------------------|--------------------------------------------------|
| `--listen` | `:8080` | HTTP listen address |
| `--target` | `localhost:9091` | Default gRPC endpoint (aggregator or collector) |
| `--n` | `25` | Default number of table rows |
| `--refresh` | `30` | Auto-refresh interval in seconds; `0` to disable |
### Usage
Navigate to `http://your-host:8080`. The dashboard shows a ranked table of the top entries for
the selected dimension and time window.
**Filter controls:**
- Click any row to add that value as a filter (e.g. click a website to restrict to it)
- The filter breadcrumb at the top shows all active filters; click any token to remove it
- Use the window tabs to switch between 1m / 5m / 15m / 60m / 6h / 24h
- The page auto-refreshes every 30 seconds
**Window tabs** — switch between `1m / 5m / 15m / 60m / 6h / 24h`. Only the window changes;
all active filters are preserved.
**Dimension selector:** switch between grouping by Website, Client Prefix, Request URI, or HTTP
Status using the tabs at the top of the table.
**Dimension tabs** switch between grouping by `website / prefix / uri / status`.
**Sparkline:** the trend chart shows total request count per bucket for the selected window and
active filters. Useful for spotting sudden spikes.
**Drilldown** — click any table row to add that value as a filter and advance to the next
dimension in the hierarchy:
**URL sharing:** all filter state is in the URL query string — copy the URL to share a specific
view with another operator.
```
website → client prefix → request URI → HTTP status → website (cycles)
```
Example: click `example.com` in the website view to see which client prefixes are hitting it;
click a prefix there to see which URIs it is requesting; and so on.
**Breadcrumb strip** — shows all active filters above the table. Click `×` next to any token
to remove just that filter, keeping the others.
**Sparkline** — inline SVG trend chart showing total request count per time bucket for the
current filter state. Useful for spotting sudden spikes or sustained DDoS ramps.
**URL sharing** — all filter state is in the URL query string (`w`, `by`, `f_website`,
`f_prefix`, `f_uri`, `f_status`, `n`). Copy the URL to share an exact view with another
operator, or bookmark a recurring query.
**JSON output** — append `&raw=1` to any URL to receive the TopN result as JSON instead of
HTML. Useful for scripting without the CLI binary:
```bash
curl -s 'http://frontend:8080/?f_status=429&by=prefix&w=1m&raw=1' | jq '.entries[0]'
```
**Target override** — append `?target=host:port` to point the frontend at a different gRPC
endpoint for that request (useful for comparing a single collector against the aggregator):
```bash
http://frontend:8080/?target=nginx3:9090&w=5m
```
---
## CLI
A shell companion for one-off queries and debugging. Outputs JSON; pipe to `jq` for filtering.
A shell companion for one-off queries and debugging. Works with any `LogtailService` endpoint —
collector or aggregator. Accepts multiple targets, fans out concurrently, and labels each result.
Default output is a human-readable table; add `--json` for machine-readable NDJSON.
### Subcommands
```
cli topn --target HOST:PORT [filters] [--by DIM] [--window W] [--n N] [--pretty]
cli trend --target HOST:PORT [filters] [--window W] [--pretty]
cli stream --target HOST:PORT [--pretty]
logtail-cli topn [flags] ranked label → count table
logtail-cli trend [flags] per-bucket time series
logtail-cli stream [flags] live snapshot feed (runs until Ctrl-C)
```
### Common flags
### Shared flags (all subcommands)
| Flag | Default | Description |
|---------------|------------------|----------------------------------------------------------|
| `--target` | `localhost:9090` | gRPC address of collector or aggregator |
| `--by` | `website` | Dimension: `website` `prefix` `uri` `status` |
| `--window` | `5m` | Window: `1m` `5m` `15m` `60m` `6h` `24h` |
| `--n` | `10` | Number of results |
| `--target` | `localhost:9090` | Comma-separated `host:port` list; queries fan out to all |
| `--json` | false | Emit newline-delimited JSON instead of a table |
| `--website` | — | Filter to this website |
| `--prefix` | — | Filter to this client prefix |
| `--uri` | — | Filter to this request URI |
| `--status` | — | Filter to this HTTP status code |
| `--pretty` | false | Pretty-print JSON |
| `--status` | — | Filter to this HTTP status code (integer) |
### `topn` flags
| Flag | Default | Description |
|---------------|------------|----------------------------------------------------------|
| `--n` | `10` | Number of entries |
| `--window` | `5m` | `1m` `5m` `15m` `60m` `6h` `24h` |
| `--group-by` | `website` | `website` `prefix` `uri` `status` |
### `trend` flags
| Flag | Default | Description |
|---------------|------------|----------------------------------------------------------|
| `--window` | `5m` | `1m` `5m` `15m` `60m` `6h` `24h` |
### Output format
**Table** (default — single target, no header):
```
RANK COUNT LABEL
1 18 432 example.com
2 4 211 other.com
```
**Multi-target** — each target gets a labeled section:
```
=== col-1 (nginx1:9090) ===
RANK COUNT LABEL
1 10 000 example.com
=== agg-prod (agg:9091) ===
RANK COUNT LABEL
1 18 432 example.com
```
**JSON** (`--json`) — one object per target, suitable for `jq`:
```json
{"source":"agg-prod","target":"agg:9091","entries":[{"label":"example.com","count":18432},...]}
```
**`stream` JSON** — one object per snapshot received (NDJSON), runs until interrupted:
```json
{"ts":1773516180,"source":"col-1","target":"nginx1:9090","total_entries":823,"top_label":"example.com","top_count":10000}
```
### Examples
```bash
# Top 20 client prefixes sending 429s right now
cli topn --target agg:9091 --window 1m --by prefix --status 429 --n 20 | jq '.entries[]'
logtail-cli topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20
# Which website has the most 503s in the last 24h?
cli topn --target agg:9091 --window 24h --by website --status 503
# Same query, pipe to jq for scripting
logtail-cli topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20 \
--json | jq '.entries[0]'
# Trend of 429s on one site over 6h — pipe to a quick graph
cli trend --target agg:9091 --window 6h --website api.example.com \
| jq '[.points[] | {t: .time, n: .count}]'
# Which website has the most 503s over the last 24h?
logtail-cli topn --target agg:9091 --window 24h --group-by website --status 503
# Watch live snapshots from one collector; alert on large entry counts
cli stream --target nginx3:9090 | jq -c 'select(.entry_count > 50000)'
# Drill: top URIs on one website over the last 60 minutes
logtail-cli topn --target agg:9091 --window 60m --group-by uri --website api.example.com
# Query a single collector directly (bypass aggregator)
cli topn --target nginx1:9090 --window 5m --by prefix --pretty
# Compare two collectors side by side in one command
logtail-cli topn --target nginx1:9090,nginx2:9090 --window 5m
# Query both a collector and the aggregator at once
logtail-cli topn --target nginx3:9090,agg:9091 --window 5m --group-by prefix
# Trend of total traffic over 6h (for a quick sparkline in the terminal)
logtail-cli trend --target agg:9091 --window 6h --json | jq '[.points[] | .count]'
# Watch live merged snapshots from the aggregator
logtail-cli stream --target agg:9091
# Watch two collectors simultaneously; each snapshot is labeled by source
logtail-cli stream --target nginx1:9090,nginx2:9090
```
The `stream` subcommand emits one JSON object per line (NDJSON) and runs until interrupted.
Exit code is non-zero on any gRPC error.
The `stream` subcommand reconnects automatically after errors (5 s backoff) and runs until
interrupted with Ctrl-C. The `topn` and `trend` subcommands exit immediately after one response.
---