Update docs with collector, aggregator, CLI and frontend

This commit is contained in:
2026-03-14 20:45:34 +01:00
parent 4369e66dee
commit c092561af2

View File

@@ -2,7 +2,7 @@
## Overview ## Overview
nginx-logtail is a three-component system for real-time traffic analysis across a cluster of nginx nginx-logtail is a four-component system for real-time traffic analysis across a cluster of nginx
machines. It answers questions like: machines. It answers questions like:
- Which client prefix is causing the most HTTP 429s right now? - Which client prefix is causing the most HTTP 429s right now?
@@ -166,8 +166,10 @@ WantedBy=multi-user.target
## Aggregator ## Aggregator
Runs on a central machine. Connects to all collectors via gRPC streaming, merges their snapshots Runs on a central machine. Subscribes to the `StreamSnapshots` push stream from every configured
into a unified view, and serves the same gRPC interface as the collector. collector, merges their snapshots into a unified in-memory cache, and serves the same gRPC
interface as the collector. The frontend and CLI query the aggregator exactly as they would query
a single collector.
### Flags ### Flags
@@ -177,100 +179,216 @@ into a unified view, and serves the same gRPC interface as the collector.
| `--collectors` | — | Comma-separated `host:port` addresses of collectors | | `--collectors` | — | Comma-separated `host:port` addresses of collectors |
| `--source` | hostname | Name for this aggregator in query responses | | `--source` | hostname | Name for this aggregator in query responses |
`--collectors` is required; the aggregator exits immediately if it is not set.
### Example ### Example
```bash ```bash
./aggregator \ ./aggregator \
--collectors nginx1:9090,nginx2:9090,nginx3:9090 \ --collectors nginx1:9090,nginx2:9090,nginx3:9090 \
--listen :9091 --listen :9091 \
--source agg-prod
``` ```
The aggregator tolerates collector failures — if one collector is unreachable, results from the ### Fault tolerance
remaining collectors are returned with a warning. It reconnects automatically with backoff.
The aggregator reconnects to each collector independently with exponential backoff (100 ms →
doubles → cap 30 s). After 3 consecutive failures to a collector it marks that collector
**degraded**: its last-known contribution is subtracted from the merged view so stale counts
do not accumulate. When the collector recovers and sends a new snapshot, it is automatically
reintegrated. The remaining collectors continue serving queries throughout.
### Memory
The aggregator's merged cache uses the same tiered ring-buffer structure as the collector
(60 × 1-min fine, 288 × 5-min coarse) but holds at most top-50 000 entries per fine bucket
and top-5 000 per coarse bucket across all collectors combined. Memory footprint is roughly
the same as one collector (~845 MB worst case).
### Systemd unit example
```ini
[Unit]
Description=nginx-logtail aggregator
After=network.target
[Service]
ExecStart=/usr/local/bin/aggregator \
--collectors nginx1:9090,nginx2:9090,nginx3:9090 \
--listen :9091 \
--source %H
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
```
--- ---
## Frontend ## Frontend
HTTP dashboard. Connects to the aggregator (or directly to a single collector for debugging). HTTP dashboard. Connects to the aggregator (or directly to a single collector for debugging).
Zero JavaScript — server-rendered HTML with inline SVG sparklines.
### Flags ### Flags
| Flag | Default | Description | | Flag | Default | Description |
|-------------|--------------|---------------------------------------| |-------------|-------------------|--------------------------------------------------|
| `--listen` | `:8080` | HTTP listen address | | `--listen` | `:8080` | HTTP listen address |
| `--target` | `localhost:9091` | gRPC address of aggregator or collector | | `--target` | `localhost:9091` | Default gRPC endpoint (aggregator or collector) |
| `--n` | `25` | Default number of table rows |
| `--refresh` | `30` | Auto-refresh interval in seconds; `0` to disable |
### Usage ### Usage
Navigate to `http://your-host:8080`. The dashboard shows a ranked table of the top entries for Navigate to `http://your-host:8080`. The dashboard shows a ranked table of the top entries for
the selected dimension and time window. the selected dimension and time window.
**Filter controls:** **Window tabs** — switch between `1m / 5m / 15m / 60m / 6h / 24h`. Only the window changes;
- Click any row to add that value as a filter (e.g. click a website to restrict to it) all active filters are preserved.
- The filter breadcrumb at the top shows all active filters; click any token to remove it
- Use the window tabs to switch between 1m / 5m / 15m / 60m / 6h / 24h
- The page auto-refreshes every 30 seconds
**Dimension selector:** switch between grouping by Website, Client Prefix, Request URI, or HTTP **Dimension tabs** switch between grouping by `website / prefix / uri / status`.
Status using the tabs at the top of the table.
**Sparkline:** the trend chart shows total request count per bucket for the selected window and **Drilldown** — click any table row to add that value as a filter and advance to the next
active filters. Useful for spotting sudden spikes. dimension in the hierarchy:
**URL sharing:** all filter state is in the URL query string — copy the URL to share a specific ```
view with another operator. website → client prefix → request URI → HTTP status → website (cycles)
```
Example: click `example.com` in the website view to see which client prefixes are hitting it;
click a prefix there to see which URIs it is requesting; and so on.
**Breadcrumb strip** — shows all active filters above the table. Click `×` next to any token
to remove just that filter, keeping the others.
**Sparkline** — inline SVG trend chart showing total request count per time bucket for the
current filter state. Useful for spotting sudden spikes or sustained DDoS ramps.
**URL sharing** — all filter state is in the URL query string (`w`, `by`, `f_website`,
`f_prefix`, `f_uri`, `f_status`, `n`). Copy the URL to share an exact view with another
operator, or bookmark a recurring query.
**JSON output** — append `&raw=1` to any URL to receive the TopN result as JSON instead of
HTML. Useful for scripting without the CLI binary:
```bash
curl -s 'http://frontend:8080/?f_status=429&by=prefix&w=1m&raw=1' | jq '.entries[0]'
```
**Target override** — append `?target=host:port` to point the frontend at a different gRPC
endpoint for that request (useful for comparing a single collector against the aggregator):
```bash
http://frontend:8080/?target=nginx3:9090&w=5m
```
--- ---
## CLI ## CLI
A shell companion for one-off queries and debugging. Outputs JSON; pipe to `jq` for filtering. A shell companion for one-off queries and debugging. Works with any `LogtailService` endpoint —
collector or aggregator. Accepts multiple targets, fans out concurrently, and labels each result.
Default output is a human-readable table; add `--json` for machine-readable NDJSON.
### Subcommands ### Subcommands
``` ```
cli topn --target HOST:PORT [filters] [--by DIM] [--window W] [--n N] [--pretty] logtail-cli topn [flags] ranked label → count table
cli trend --target HOST:PORT [filters] [--window W] [--pretty] logtail-cli trend [flags] per-bucket time series
cli stream --target HOST:PORT [--pretty] logtail-cli stream [flags] live snapshot feed (runs until Ctrl-C)
``` ```
### Common flags ### Shared flags (all subcommands)
| Flag | Default | Description | | Flag | Default | Description |
|---------------|------------------|----------------------------------------------------------| |---------------|------------------|----------------------------------------------------------|
| `--target` | `localhost:9090` | gRPC address of collector or aggregator | | `--target` | `localhost:9090` | Comma-separated `host:port` list; queries fan out to all |
| `--by` | `website` | Dimension: `website` `prefix` `uri` `status` | | `--json` | false | Emit newline-delimited JSON instead of a table |
| `--window` | `5m` | Window: `1m` `5m` `15m` `60m` `6h` `24h` |
| `--n` | `10` | Number of results |
| `--website` | — | Filter to this website | | `--website` | — | Filter to this website |
| `--prefix` | — | Filter to this client prefix | | `--prefix` | — | Filter to this client prefix |
| `--uri` | — | Filter to this request URI | | `--uri` | — | Filter to this request URI |
| `--status` | — | Filter to this HTTP status code | | `--status` | — | Filter to this HTTP status code (integer) |
| `--pretty` | false | Pretty-print JSON |
### `topn` flags
| Flag | Default | Description |
|---------------|------------|----------------------------------------------------------|
| `--n` | `10` | Number of entries |
| `--window` | `5m` | `1m` `5m` `15m` `60m` `6h` `24h` |
| `--group-by` | `website` | `website` `prefix` `uri` `status` |
### `trend` flags
| Flag | Default | Description |
|---------------|------------|----------------------------------------------------------|
| `--window` | `5m` | `1m` `5m` `15m` `60m` `6h` `24h` |
### Output format
**Table** (default — single target, no header):
```
RANK COUNT LABEL
1 18 432 example.com
2 4 211 other.com
```
**Multi-target** — each target gets a labeled section:
```
=== col-1 (nginx1:9090) ===
RANK COUNT LABEL
1 10 000 example.com
=== agg-prod (agg:9091) ===
RANK COUNT LABEL
1 18 432 example.com
```
**JSON** (`--json`) — one object per target, suitable for `jq`:
```json
{"source":"agg-prod","target":"agg:9091","entries":[{"label":"example.com","count":18432},...]}
```
**`stream` JSON** — one object per snapshot received (NDJSON), runs until interrupted:
```json
{"ts":1773516180,"source":"col-1","target":"nginx1:9090","total_entries":823,"top_label":"example.com","top_count":10000}
```
### Examples ### Examples
```bash ```bash
# Top 20 client prefixes sending 429s right now # Top 20 client prefixes sending 429s right now
cli topn --target agg:9091 --window 1m --by prefix --status 429 --n 20 | jq '.entries[]' logtail-cli topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20
# Which website has the most 503s in the last 24h? # Same query, pipe to jq for scripting
cli topn --target agg:9091 --window 24h --by website --status 503 logtail-cli topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20 \
--json | jq '.entries[0]'
# Trend of 429s on one site over 6h — pipe to a quick graph # Which website has the most 503s over the last 24h?
cli trend --target agg:9091 --window 6h --website api.example.com \ logtail-cli topn --target agg:9091 --window 24h --group-by website --status 503
| jq '[.points[] | {t: .time, n: .count}]'
# Watch live snapshots from one collector; alert on large entry counts # Drill: top URIs on one website over the last 60 minutes
cli stream --target nginx3:9090 | jq -c 'select(.entry_count > 50000)' logtail-cli topn --target agg:9091 --window 60m --group-by uri --website api.example.com
# Query a single collector directly (bypass aggregator) # Compare two collectors side by side in one command
cli topn --target nginx1:9090 --window 5m --by prefix --pretty logtail-cli topn --target nginx1:9090,nginx2:9090 --window 5m
# Query both a collector and the aggregator at once
logtail-cli topn --target nginx3:9090,agg:9091 --window 5m --group-by prefix
# Trend of total traffic over 6h (for a quick sparkline in the terminal)
logtail-cli trend --target agg:9091 --window 6h --json | jq '[.points[] | .count]'
# Watch live merged snapshots from the aggregator
logtail-cli stream --target agg:9091
# Watch two collectors simultaneously; each snapshot is labeled by source
logtail-cli stream --target nginx1:9090,nginx2:9090
``` ```
The `stream` subcommand emits one JSON object per line (NDJSON) and runs until interrupted. The `stream` subcommand reconnects automatically after errors (5 s backoff) and runs until
Exit code is non-zero on any gRPC error. interrupted with Ctrl-C. The `topn` and `trend` subcommands exit immediately after one response.
--- ---