Files

Pim van Pelt c092561af2 Update docs with collector, aggregator, CLI and frontend

2026-03-14 20:45:34 +01:00

15 KiB

Raw Blame History

nginx-logtail User Guide

Overview

nginx-logtail is a four-component system for real-time traffic analysis across a cluster of nginx machines. It answers questions like:

Which client prefix is causing the most HTTP 429s right now?
Which website is getting the most 503s over the last 24 hours?
Which nginx machine is the busiest?
Is there a DDoS in progress, and from where?

Components:

Binary	Runs on	Role
`collector`	each nginx host	Tails log files, aggregates in memory, serves gRPC
`aggregator`	central host	Merges all collectors, serves unified gRPC
`frontend`	central host	HTTP dashboard with drilldown UI
`cli`	operator laptop	Shell queries against collector or aggregator

nginx Configuration

Add the logtail log format to your nginx.conf and apply it to each server block:

http {
    log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time';

    server {
        access_log /var/log/nginx/access.log logtail;
        # or per-vhost:
        access_log /var/log/nginx/www.example.com.access.log logtail;
    }
}

The format is tab-separated with fixed field positions. Query strings are stripped from the URI by the collector at ingest time — only the path is tracked.

Building

git clone https://git.ipng.ch/ipng/nginx-logtail
cd nginx-logtail
go build ./cmd/collector/
go build ./cmd/aggregator/
go build ./cmd/frontend/
go build ./cmd/cli/

Requires Go 1.21+. No CGO, no external runtime dependencies.

Collector

Runs on each nginx machine. Tails log files, maintains in-memory top-K counters across six time windows, and exposes a gRPC interface for the aggregator (and directly for the CLI).

Flags

Flag	Default	Description
`--listen`	`:9090`	gRPC listen address
`--logs`	—	Comma-separated log file paths or glob patterns
`--logs-file`	—	File containing one log path/glob per line
`--source`	hostname	Name for this collector in query responses
`--v4prefix`	`24`	IPv4 prefix length for client bucketing (e.g. /24 → /23)
`--v6prefix`	`48`	IPv6 prefix length for client bucketing

At least one of --logs or --logs-file is required.

Examples

# Single file
./collector --logs /var/log/nginx/access.log

# Multiple files via glob (one inotify instance regardless of count)
./collector --logs "/var/log/nginx/*/access.log"

# Many files via a config file
./collector --logs-file /etc/nginx-logtail/logs.conf

# Custom prefix lengths and listen address
./collector \
  --logs "/var/log/nginx/*.log" \
  --listen :9091 \
  --source nginx3.prod \
  --v4prefix 24 \
  --v6prefix 48

logs-file format

One path or glob pattern per line. Lines starting with # are ignored.

# /etc/nginx-logtail/logs.conf
/var/log/nginx/access.log
/var/log/nginx/*/access.log
/var/log/nginx/api.example.com.access.log

Log rotation

The collector handles logrotate automatically. On RENAME/REMOVE events it drains the old file descriptor to EOF (so no lines are lost), then retries opening the original path with backoff until the new file appears. No restart or SIGHUP required.

Memory usage

The collector is designed to stay well under 1 GB:

Structure	Max entries	Approx size
Live map (current minute)	100 000	~19 MB
Fine ring (60 × 1-min)	60 × 50 000	~558 MB
Coarse ring (288 × 5-min)	288 × 5 000	~268 MB
Total		~845 MB

When the live map reaches 100 000 distinct 4-tuples, new keys are dropped for the rest of that minute. Existing keys continue to accumulate counts. The cap resets at each minute rotation.

Time windows

Data is served from two tiered ring buffers:

Window	Source ring	Resolution
1 min	fine	1 minute
5 min	fine	1 minute
15 min	fine	1 minute
60 min	fine	1 minute
6 h	coarse	5 minutes
24 h	coarse	5 minutes

History is lost on restart — the collector resumes tailing immediately but all ring buffers start empty. The fine ring fills in 1 hour; the coarse ring fills in 24 hours.

Systemd unit example

[Unit]
Description=nginx-logtail collector
After=network.target

[Service]
ExecStart=/usr/local/bin/collector \
  --logs-file /etc/nginx-logtail/logs.conf \
  --listen :9090 \
  --source %H
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Aggregator

Runs on a central machine. Subscribes to the StreamSnapshots push stream from every configured collector, merges their snapshots into a unified in-memory cache, and serves the same gRPC interface as the collector. The frontend and CLI query the aggregator exactly as they would query a single collector.

Flags

Flag	Default	Description
`--listen`	`:9091`	gRPC listen address
`--collectors`	—	Comma-separated `host:port` addresses of collectors
`--source`	hostname	Name for this aggregator in query responses

--collectors is required; the aggregator exits immediately if it is not set.

Example

./aggregator \
  --collectors nginx1:9090,nginx2:9090,nginx3:9090 \
  --listen :9091 \
  --source agg-prod

Fault tolerance

The aggregator reconnects to each collector independently with exponential backoff (100 ms → doubles → cap 30 s). After 3 consecutive failures to a collector it marks that collector degraded: its last-known contribution is subtracted from the merged view so stale counts do not accumulate. When the collector recovers and sends a new snapshot, it is automatically reintegrated. The remaining collectors continue serving queries throughout.

Memory

The aggregator's merged cache uses the same tiered ring-buffer structure as the collector (60 × 1-min fine, 288 × 5-min coarse) but holds at most top-50 000 entries per fine bucket and top-5 000 per coarse bucket across all collectors combined. Memory footprint is roughly the same as one collector (~845 MB worst case).

Systemd unit example

[Unit]
Description=nginx-logtail aggregator
After=network.target

[Service]
ExecStart=/usr/local/bin/aggregator \
  --collectors nginx1:9090,nginx2:9090,nginx3:9090 \
  --listen :9091 \
  --source %H
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Frontend

HTTP dashboard. Connects to the aggregator (or directly to a single collector for debugging). Zero JavaScript — server-rendered HTML with inline SVG sparklines.

Flags

Flag	Default	Description
`--listen`	`:8080`	HTTP listen address
`--target`	`localhost:9091`	Default gRPC endpoint (aggregator or collector)
`--n`	`25`	Default number of table rows
`--refresh`	`30`	Auto-refresh interval in seconds; `0` to disable

Usage

Navigate to http://your-host:8080. The dashboard shows a ranked table of the top entries for the selected dimension and time window.

Window tabs — switch between 1m / 5m / 15m / 60m / 6h / 24h. Only the window changes; all active filters are preserved.

Dimension tabs — switch between grouping by website / prefix / uri / status.

Drilldown — click any table row to add that value as a filter and advance to the next dimension in the hierarchy:

website → client prefix → request URI → HTTP status → website (cycles)

Example: click example.com in the website view to see which client prefixes are hitting it; click a prefix there to see which URIs it is requesting; and so on.

Breadcrumb strip — shows all active filters above the table. Click × next to any token to remove just that filter, keeping the others.

Sparkline — inline SVG trend chart showing total request count per time bucket for the current filter state. Useful for spotting sudden spikes or sustained DDoS ramps.

URL sharing — all filter state is in the URL query string (w, by, f_website, f_prefix, f_uri, f_status, n). Copy the URL to share an exact view with another operator, or bookmark a recurring query.

JSON output — append &raw=1 to any URL to receive the TopN result as JSON instead of HTML. Useful for scripting without the CLI binary:

curl -s 'http://frontend:8080/?f_status=429&by=prefix&w=1m&raw=1' | jq '.entries[0]'

Target override — append ?target=host:port to point the frontend at a different gRPC endpoint for that request (useful for comparing a single collector against the aggregator):

http://frontend:8080/?target=nginx3:9090&w=5m

CLI

A shell companion for one-off queries and debugging. Works with any LogtailService endpoint — collector or aggregator. Accepts multiple targets, fans out concurrently, and labels each result. Default output is a human-readable table; add --json for machine-readable NDJSON.

Subcommands

logtail-cli topn    [flags]   ranked label → count table
logtail-cli trend   [flags]   per-bucket time series
logtail-cli stream  [flags]   live snapshot feed (runs until Ctrl-C)

Shared flags (all subcommands)

Flag	Default	Description
`--target`	`localhost:9090`	Comma-separated `host:port` list; queries fan out to all
`--json`	false	Emit newline-delimited JSON instead of a table
`--website`	—	Filter to this website
`--prefix`	—	Filter to this client prefix
`--uri`	—	Filter to this request URI
`--status`	—	Filter to this HTTP status code (integer)

`topn` flags

Flag	Default	Description
`--n`	`10`	Number of entries
`--window`	`5m`	`1m` `5m` `15m` `60m` `6h` `24h`
`--group-by`	`website`	`website` `prefix` `uri` `status`

`trend` flags

Flag	Default	Description
`--window`	`5m`	`1m` `5m` `15m` `60m` `6h` `24h`

Output format

Table (default — single target, no header):

RANK  COUNT    LABEL
   1  18 432   example.com
   2   4 211   other.com

Multi-target — each target gets a labeled section:

=== col-1 (nginx1:9090) ===
RANK  COUNT    LABEL
   1  10 000   example.com

=== agg-prod (agg:9091) ===
RANK  COUNT    LABEL
   1  18 432   example.com

JSON (--json) — one object per target, suitable for jq:

{"source":"agg-prod","target":"agg:9091","entries":[{"label":"example.com","count":18432},...]}

stream JSON — one object per snapshot received (NDJSON), runs until interrupted:

{"ts":1773516180,"source":"col-1","target":"nginx1:9090","total_entries":823,"top_label":"example.com","top_count":10000}

Examples

# Top 20 client prefixes sending 429s right now
logtail-cli topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20

# Same query, pipe to jq for scripting
logtail-cli topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20 \
  --json | jq '.entries[0]'

# Which website has the most 503s over the last 24h?
logtail-cli topn --target agg:9091 --window 24h --group-by website --status 503

# Drill: top URIs on one website over the last 60 minutes
logtail-cli topn --target agg:9091 --window 60m --group-by uri --website api.example.com

# Compare two collectors side by side in one command
logtail-cli topn --target nginx1:9090,nginx2:9090 --window 5m

# Query both a collector and the aggregator at once
logtail-cli topn --target nginx3:9090,agg:9091 --window 5m --group-by prefix

# Trend of total traffic over 6h (for a quick sparkline in the terminal)
logtail-cli trend --target agg:9091 --window 6h --json | jq '[.points[] | .count]'

# Watch live merged snapshots from the aggregator
logtail-cli stream --target agg:9091

# Watch two collectors simultaneously; each snapshot is labeled by source
logtail-cli stream --target nginx1:9090,nginx2:9090

The stream subcommand reconnects automatically after errors (5 s backoff) and runs until interrupted with Ctrl-C. The topn and trend subcommands exit immediately after one response.

Operational notes

No persistence. All data is in-memory. A collector restart loses ring buffer history but resumes tailing the log file from the current position immediately.

No TLS. Designed for trusted internal networks. If you need encryption in transit, put a TLS-terminating proxy (e.g. stunnel, nginx stream) in front of the gRPC port.

inotify limits. The collector uses a single inotify instance regardless of how many files it tails. If you tail files across many different directories, check /proc/sys/fs/inotify/max_user_watches (default 8192); increase it if needed:

echo 65536 | sudo tee /proc/sys/fs/inotify/max_user_watches

High-cardinality attacks. If a DDoS sends traffic from thousands of unique /24 prefixes with unique URIs, the live map will hit its 100 000 entry cap and drop new keys for the rest of that minute. The top-K entries already tracked continue accumulating counts. This is by design — the cap prevents memory exhaustion under attack conditions.

Clock skew. Trend sparklines are based on the collector's local clock. If collectors have significant clock skew, trend buckets from different collectors may not align precisely in the aggregator. NTP sync is recommended.

15 KiB Raw Blame History Unescape Escape

nginx-logtail User Guide

Overview

nginx Configuration

Building

Collector

Flags

Examples

logs-file format

Log rotation

Memory usage

Time windows

Systemd unit example

Aggregator

Flags

Example

Fault tolerance

Memory

Systemd unit example

Frontend

Flags

Usage

CLI

Subcommands

Shared flags (all subcommands)

topn flags

trend flags

Output format

Examples

Operational notes

15 KiB

Raw Blame History

`topn` flags

`trend` flags