nginx-ipng-stats-plugin's ipng_stats_logtail directive buffers many log lines into a single UDP datagram (default buffer=64k flush=1s). The listener was treating each datagram as exactly one log line, so any datagram with N>1 lines failed the v1 field-count check and dropped silently. In production this showed up as logtail_udp_packets_received_total roughly 4x logtail_udp_loglines_success_total — matching typical burst-coalesced 4-lines-per-batch ratios. Fix: strip trailing CRLF, split the payload on '\n', parse each non-empty line independently. Counter semantics now match the names: packets_received — datagrams off the socket (one per recvfrom) loglines_success — log lines parsed OK (may be many per datagram) loglines_consumed — log lines forwarded to the store (not dropped) After the fix, loglines_success ≈ packets_received × avg_lines_per_batch. Regression test TestUDPListenerBatchedDatagram sends one datagram with three '\n'-separated v1 lines and asserts all three LogRecords arrive, plus loglines_success >= 3 * packets_received. Docs (user-guide.md, design.md) now explain the datagram-vs-line unit distinction so operators don't misread the ratio. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
791 lines
34 KiB
Markdown
791 lines
34 KiB
Markdown
# nginx-logtail User Guide
|
||
|
||
## Overview
|
||
|
||
nginx-logtail is a four-component system for real-time traffic analysis across a cluster of nginx
|
||
machines. It answers questions like:
|
||
|
||
- Which client prefix is causing the most HTTP 429s right now?
|
||
- Which website is getting the most 503s over the last 24 hours?
|
||
- Which nginx machine is the busiest?
|
||
- Is there a DDoS in progress, and from where?
|
||
|
||
Components:
|
||
|
||
| Binary | Runs on | Role |
|
||
|---------------|------------------|----------------------------------------------------|
|
||
| `collector` | each nginx host | Tails log files and/or UDP datagrams, aggregates in memory, serves gRPC |
|
||
| `aggregator` | central host | Merges all collectors, serves unified gRPC |
|
||
| `frontend` | central host | HTTP dashboard with drilldown UI |
|
||
| `cli` | operator laptop | Shell queries against collector or aggregator |
|
||
|
||
Every binary accepts `-version` (or `nginx-logtail version` for the CLI) and prints its version,
|
||
git commit, and build date.
|
||
|
||
---
|
||
|
||
## Installation
|
||
|
||
Three flavors. `make help` lists every target; `make install-deps` sets up a fresh build box
|
||
(apt deps, Go toolchain, `protoc-gen-go`, `golangci-lint`).
|
||
|
||
### Debian package
|
||
|
||
```bash
|
||
make pkg-deb # produces nginx-logtail_<ver>_{amd64,arm64}.deb
|
||
sudo dpkg -i nginx-logtail_*_amd64.deb
|
||
```
|
||
|
||
The package installs:
|
||
|
||
| Path | Contents |
|
||
|---------------------------------------------------------------|---------------------------------------------------|
|
||
| `/usr/sbin/nginx-logtail-{collector,aggregator,frontend}` | Service binaries |
|
||
| `/usr/bin/nginx-logtail` | CLI |
|
||
| `/lib/systemd/system/nginx-logtail-*.service` | Three systemd units |
|
||
| `/usr/share/man/man8/nginx-logtail.8.gz` | Manpage (`man 8 nginx-logtail`) |
|
||
| `/usr/share/nginx-logtail/default.template` | Defaults template |
|
||
| `/etc/default/nginx-logtail` | **Generated on first install** from the template |
|
||
|
||
The postinst creates a system user/group `_logtail` if absent and renders the template into
|
||
`/etc/default/nginx-logtail` with the short hostname substituted. **None of the services are
|
||
enabled or started automatically** — installing the package is safe on any host. Operators
|
||
opt in per service:
|
||
|
||
```bash
|
||
sudo systemctl enable --now nginx-logtail-collector.service # on each nginx host
|
||
sudo systemctl enable --now nginx-logtail-aggregator.service # on the central host
|
||
sudo systemctl enable --now nginx-logtail-frontend.service # on the central host
|
||
```
|
||
|
||
The collector runs as `_logtail:www-data` so it can read nginx access logs that are
|
||
group-readable by `www-data`; aggregator and frontend run as `_logtail:_logtail`.
|
||
|
||
### Docker / Docker Compose
|
||
|
||
The repo's `docker-compose.yml` runs the aggregator and frontend together from a single image
|
||
that contains all four binaries.
|
||
|
||
```bash
|
||
make docker # builds git.ipng.ch/ipng/nginx-logtail:v<ver> + :latest, native arch
|
||
make docker-push # multi-arch (amd64+arm64) buildx push
|
||
|
||
AGGREGATOR_COLLECTORS=nginx1:9090,nginx2:9090 docker compose up -d
|
||
# frontend on :8080, aggregator gRPC on :9091
|
||
```
|
||
|
||
Each container explicitly selects its binary via `command: ["/usr/local/bin/<binary>"]`.
|
||
|
||
### From source
|
||
|
||
```bash
|
||
git clone https://git.ipng.ch/ipng/nginx-logtail
|
||
cd nginx-logtail
|
||
make build # -> build/<arch>/{collector,aggregator,frontend,cli}
|
||
make test
|
||
./build/*/cli version
|
||
```
|
||
|
||
Requires Go ≥ 1.24 (see `go.mod`). No CGO, no external runtime dependencies.
|
||
|
||
---
|
||
|
||
## Configuration
|
||
|
||
### /etc/default/nginx-logtail
|
||
|
||
The Debian package ships one shared environment file read by all three systemd units via
|
||
`EnvironmentFile=-/etc/default/nginx-logtail`. It enumerates every flag the three daemons
|
||
accept as a `COLLECTOR_*`, `AGGREGATOR_*`, or `FRONTEND_*` env var. Defaults on first install
|
||
are sensible for a single-host deployment:
|
||
|
||
| Variable | First-install default | Purpose |
|
||
|----------------------------|------------------------------|---------------------------------------------------|
|
||
| `COLLECTOR_LISTEN` | `:9090` | gRPC listen address |
|
||
| `COLLECTOR_PROM_LISTEN` | `:9100` | Prometheus metrics; set `""` to disable |
|
||
| `COLLECTOR_LOGS` | *(empty — UDP-only)* | Comma-sep log paths/globs |
|
||
| `COLLECTOR_LOGS_FILE` | *(empty)* | File with one path/glob per line |
|
||
| `COLLECTOR_SOURCE` | `$(hostname -s)` at install | Display name in query responses |
|
||
| `COLLECTOR_V4PREFIX` | `24` | IPv4 bucket prefix |
|
||
| `COLLECTOR_V6PREFIX` | `48` | IPv6 bucket prefix |
|
||
| `COLLECTOR_SCAN_INTERVAL` | `10s` | Log-glob rescan cadence |
|
||
| `COLLECTOR_LOGTAIL_PORT` | `9514` | UDP port for `ipng_stats_logtail` (0 disables) |
|
||
| `COLLECTOR_LOGTAIL_BIND` | `127.0.0.1` | UDP bind address |
|
||
| `AGGREGATOR_LISTEN` | `:9091` | gRPC listen address |
|
||
| `AGGREGATOR_COLLECTORS` | `localhost:9090` | Comma-sep collectors (mandatory) |
|
||
| `AGGREGATOR_SOURCE` | `$(hostname -s)` at install | Display name |
|
||
| `FRONTEND_LISTEN` | `:8080` | HTTP dashboard address |
|
||
| `FRONTEND_TARGET` | `localhost:9091` | Default gRPC endpoint |
|
||
| `FRONTEND_N` | `25` | Default table row count |
|
||
| `FRONTEND_REFRESH` | `30` | Meta-refresh seconds; `0` disables |
|
||
|
||
At least one of `COLLECTOR_LOGS`, `COLLECTOR_LOGS_FILE`, or `COLLECTOR_LOGTAIL_PORT > 0` must
|
||
be set, otherwise the collector refuses to start. The shipped default (`COLLECTOR_LOGS=` empty
|
||
plus `COLLECTOR_LOGTAIL_PORT=9514`) makes the collector UDP-only — no file tailer goroutine
|
||
is launched when no log patterns are supplied.
|
||
|
||
Three escape-hatch variables — `COLLECTOR_ARGS`, `AGGREGATOR_ARGS`, `FRONTEND_ARGS` — are
|
||
appended verbatim to each unit's `ExecStart` argv. Use them for flags without an env-var form,
|
||
or for temporary overrides, without editing the unit.
|
||
|
||
The file is **not a dpkg conffile**: postinst writes it only when absent, so operator edits
|
||
survive upgrades, and `dpkg --purge` removes it.
|
||
|
||
### nginx — file-based ingest
|
||
|
||
Add the `logtail` format and attach it to whichever `server` blocks you want tracked:
|
||
|
||
```nginx
|
||
http {
|
||
log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn';
|
||
|
||
server {
|
||
access_log /var/log/nginx/access.log logtail;
|
||
# or per-vhost:
|
||
access_log /var/log/nginx/www.example.com.access.log logtail;
|
||
}
|
||
}
|
||
```
|
||
|
||
Tab-separated, fixed field order, ten fields. The precise layout:
|
||
|
||
| # | Field | Ingested into |
|
||
|---|-------------------|--------------------------|
|
||
| 0 | `$host` | `website` |
|
||
| 1 | `$remote_addr` | `client_prefix` (truncated) |
|
||
| 2 | `$msec` | *(discarded)* |
|
||
| 3 | `$request_method` | Prom `method` label |
|
||
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
|
||
| 5 | `$status` | `http_response` |
|
||
| 6 | `$body_bytes_sent`| Prom body histogram |
|
||
| 7 | `$request_time` | Prom duration histogram |
|
||
| 8 | `$is_tor` | `is_tor` (optional) |
|
||
| 9 | `$asn` | `asn` (optional) |
|
||
|
||
`$is_tor` is `1` if the client IP is a TOR exit node and `0` otherwise (typically populated
|
||
via a Lua script or `$geoip2_data_*`). `$asn` is the client AS number as a decimal integer
|
||
(e.g. MaxMind GeoIP2's `$geoip2_data_autonomous_system_number`).
|
||
|
||
**If either is unknown, emit `0`.** A literal `0` in `$is_tor` parses as `false`; a literal
|
||
`0` in `$asn` parses as ASN `0`, which you can exclude at query time with `--asn '!=0'` / the
|
||
`asn!=0` filter expression. Operators who don't have TOR or GeoIP data can simply emit `0` for
|
||
both columns and everything works.
|
||
|
||
Both fields are also **positionally optional** for backward compatibility — older 8-field
|
||
lines are accepted and default to `false` / `0`. Records from the file tailer are always
|
||
tagged `source_tag="direct"`.
|
||
|
||
Then point the collector at the log files via `COLLECTOR_LOGS` — comma-separated paths or
|
||
glob patterns. Make sure the files are group-readable by `www-data` (the collector's primary
|
||
group in the systemd unit).
|
||
|
||
### nginx — UDP ingest (`nginx-ipng-stats-plugin`)
|
||
|
||
If the nginx host runs [`nginx-ipng-stats-plugin`](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin),
|
||
the plugin's `ipng_stats_logtail` directive emits one UDP datagram per request directly to
|
||
the collector, no log file involved. The wire format is **versioned** — every datagram starts
|
||
with a literal `v1\t` prefix so the collector can ship new parser versions (v2, v3, …) before
|
||
emitters are upgraded and route each packet accordingly.
|
||
|
||
```nginx
|
||
http {
|
||
log_format ipng_stats_logtail
|
||
'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
|
||
|
||
ipng_stats_logtail ipng_stats_logtail udp://127.0.0.1:9514 buffer=64k flush=1s;
|
||
}
|
||
```
|
||
|
||
Precise v1 layout — 13 tab-separated fields total (version prefix + 12 payload fields):
|
||
|
||
| # | Field | Ingested into |
|
||
|---|-------------------|------------------------------|
|
||
| 0 | `v1` | version tag |
|
||
| 1 | `$host` | `website` |
|
||
| 2 | `$remote_addr` | `client_prefix` (truncated) |
|
||
| 3 | `$request_method` | Prom `method` label |
|
||
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
|
||
| 5 | `$status` | `http_response` |
|
||
| 6 | `$body_bytes_sent`| Prom body histogram |
|
||
| 7 | `$request_time` | Prom duration histogram |
|
||
| 8 | `$is_tor` | `is_tor` |
|
||
| 9 | `$asn` | `asn` |
|
||
| 10| `$ipng_source_tag`| `source_tag` |
|
||
| 11| `$server_addr` | *(parsed and discarded)* |
|
||
| 12| `$scheme` | *(parsed and discarded)* |
|
||
|
||
Compared to the file format: the version tag is added, `$msec` is dropped, and three fields
|
||
are appended — `$ipng_source_tag` (propagated into the data model), `$server_addr` and
|
||
`$scheme` (reserved for future use).
|
||
|
||
**Unknown `$is_tor` / `$asn`: emit `0`.** Same convention as the file format — operators
|
||
without TOR or GeoIP data can emit `0` for both columns and everything works. A literal `0`
|
||
in `$is_tor` is `false`; a literal `0` in `$asn` is ASN `0`, filterable at query time.
|
||
|
||
All 13 fields are required for v1 — malformed packets (wrong version, wrong field count, bad
|
||
IP) are silently dropped and counted via `logtail_udp_packets_received_total` minus
|
||
`logtail_udp_loglines_success_total`. Both paths (file + UDP) can feed the same collector
|
||
simultaneously; they converge on the same aggregation pipeline.
|
||
|
||
---
|
||
|
||
## Collector
|
||
|
||
Runs on each nginx machine. Ingests logs from files (via `fsnotify`) and/or UDP datagrams
|
||
(from `nginx-ipng-stats-plugin`), maintains in-memory top-K counters across six time
|
||
windows, and exposes a gRPC interface for the aggregator (and directly for the CLI).
|
||
|
||
### Flags
|
||
|
||
| Flag | Default | Description |
|
||
|-------------------|---------------|-------------------------------------------------------------------|
|
||
| `--listen` | `:9090` | gRPC listen address |
|
||
| `--prom-listen` | `:9100` | Prometheus metrics address; empty string to disable |
|
||
| `--logs` | — | Comma-separated log file paths or glob patterns |
|
||
| `--logs-file` | — | File containing one log path/glob per line |
|
||
| `--source` | hostname | Name for this collector in query responses |
|
||
| `--v4prefix` | `24` | IPv4 prefix length for client bucketing |
|
||
| `--v6prefix` | `48` | IPv6 prefix length for client bucketing |
|
||
| `--scan-interval` | `10s` | How often to rescan glob patterns for new/removed files |
|
||
| `--logtail-port` | `0` (off) | UDP port receiving `ipng_stats_logtail` datagrams |
|
||
| `--logtail-bind` | `127.0.0.1` | UDP bind address |
|
||
| `--version` | — | Print version, commit, build date and exit |
|
||
|
||
At least one of `--logs`, `--logs-file`, or `--logtail-port > 0` is required; otherwise the
|
||
collector refuses to start.
|
||
|
||
### Examples
|
||
|
||
```bash
|
||
# UDP-only (nginx-ipng-stats-plugin feed)
|
||
./collector --logtail-port 9514
|
||
|
||
# Single file
|
||
./collector --logs /var/log/nginx/access.log
|
||
|
||
# Multiple files via glob (one inotify instance regardless of count)
|
||
./collector --logs "/var/log/nginx/*/access.log"
|
||
|
||
# Files and UDP at the same time
|
||
./collector --logs "/var/log/nginx/*.log" --logtail-port 9514
|
||
|
||
# Many files via a config file
|
||
./collector --logs-file /etc/nginx-logtail/logs.conf
|
||
|
||
# Custom prefix lengths and listen address
|
||
./collector \
|
||
--logs "/var/log/nginx/*.log" \
|
||
--listen :9091 \
|
||
--source nginx3.prod \
|
||
--v4prefix 24 \
|
||
--v6prefix 48
|
||
```
|
||
|
||
### logs-file format
|
||
|
||
One path or glob pattern per line. Lines starting with `#` are ignored.
|
||
|
||
```
|
||
# /etc/nginx-logtail/logs.conf
|
||
/var/log/nginx/access.log
|
||
/var/log/nginx/*/access.log
|
||
/var/log/nginx/api.example.com.access.log
|
||
```
|
||
|
||
### Log rotation
|
||
|
||
The collector handles logrotate automatically. On `RENAME`/`REMOVE` events it drains the old file
|
||
descriptor to EOF (so no lines are lost), then retries opening the original path with backoff until
|
||
the new file appears. No restart or SIGHUP required.
|
||
|
||
### Prometheus metrics
|
||
|
||
The collector exposes a Prometheus-compatible `/metrics` endpoint on `--prom-listen` (default
|
||
`:9100`). Set `--prom-listen ""` to disable it entirely.
|
||
|
||
**Per-host series:**
|
||
|
||
- `nginx_http_requests_total{host, method, status}` — counter. Map capped at 250 000 distinct
|
||
label sets; new entries beyond the cap are dropped until the map is rolled over.
|
||
- `nginx_http_response_body_bytes_{bucket,count,sum}{host, le}` — histogram of
|
||
`$body_bytes_sent`. Buckets (bytes): `256, 1024, 4096, 16384, 65536, 262144, 1048576, +Inf`.
|
||
- `nginx_http_request_duration_seconds_{bucket,count,sum}{host, le}` — histogram of
|
||
`$request_time`. Buckets (seconds): `0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5,
|
||
10, +Inf`. Not split by `source_tag` (duration histogram stays per-host to avoid cardinality
|
||
blow-up).
|
||
|
||
**Per-`source_tag` roll-ups** (parallel series, not a cross-product with `host`):
|
||
|
||
- `nginx_http_requests_by_source_total{source_tag}` — counter.
|
||
- `nginx_http_response_body_bytes_by_source_{bucket,count,sum}{source_tag, le}` — histogram.
|
||
|
||
**UDP ingest counters** — lets operators distinguish parse failures from back-pressure drops:
|
||
|
||
- `logtail_udp_packets_received_total` — datagrams read off the socket.
|
||
- `logtail_udp_loglines_success_total` — log lines parsed OK.
|
||
- `logtail_udp_loglines_consumed_total` — log lines forwarded to the store (not dropped).
|
||
|
||
Note the unit mismatch: `packets_*` counts datagrams, `loglines_*` counts log lines.
|
||
The nginx plugin batches many log lines into a single UDP datagram (default `buffer=64k
|
||
flush=1s`), so `loglines_success ≫ packets_received` is normal — operators should see
|
||
roughly `loglines_success / packets_received ≈ avg lines per batch`.
|
||
|
||
`loglines_success - loglines_consumed` is the back-pressure drop rate (channel full).
|
||
A large gap between `packets_received * expected_lines_per_packet` and `loglines_success`
|
||
indicates parse failures.
|
||
|
||
**Prometheus scrape config:**
|
||
|
||
```yaml
|
||
scrape_configs:
|
||
- job_name: nginx_logtail
|
||
static_configs:
|
||
- targets:
|
||
- nginx1:9100
|
||
- nginx2:9100
|
||
- nginx3:9100
|
||
```
|
||
|
||
Or with service discovery — the collector has no special requirements beyond a reachable
|
||
TCP port.
|
||
|
||
**Example queries:**
|
||
|
||
```promql
|
||
# Request rate per host over last 5 minutes
|
||
rate(nginx_http_requests_total[5m])
|
||
|
||
# 5xx error rate fraction per host
|
||
sum by (host) (rate(nginx_http_requests_total{status=~"5.."}[5m]))
|
||
/
|
||
sum by (host) (rate(nginx_http_requests_total[5m]))
|
||
|
||
# 95th percentile response time per host
|
||
histogram_quantile(0.95,
|
||
sum by (host, le) (rate(nginx_http_request_duration_seconds_bucket[5m]))
|
||
)
|
||
|
||
# Median response body size per host
|
||
histogram_quantile(0.50,
|
||
sum by (host, le) (rate(nginx_http_response_body_bytes_bucket[5m]))
|
||
)
|
||
```
|
||
|
||
### Memory usage
|
||
|
||
The collector is designed to stay well under 1 GB:
|
||
|
||
| Structure | Max entries | Approx size |
|
||
|-----------------------------|-------------|-------------|
|
||
| Live map (current minute) | 100 000 | ~19 MB |
|
||
| Fine ring (60 × 1-min) | 60 × 50 000 | ~558 MB |
|
||
| Coarse ring (288 × 5-min) | 288 × 5 000 | ~268 MB |
|
||
| **Total** | | **~845 MB** |
|
||
|
||
When the live map reaches 100 000 distinct 6-tuples, new keys are dropped for the rest of that
|
||
minute. Existing keys continue to accumulate counts. The cap resets at each minute rotation.
|
||
|
||
### Time windows
|
||
|
||
Data is served from two tiered ring buffers:
|
||
|
||
| Window | Source ring | Resolution |
|
||
|--------|-------------|------------|
|
||
| 1 min | fine | 1 minute |
|
||
| 5 min | fine | 1 minute |
|
||
| 15 min | fine | 1 minute |
|
||
| 60 min | fine | 1 minute |
|
||
| 6 h | coarse | 5 minutes |
|
||
| 24 h | coarse | 5 minutes |
|
||
|
||
History is lost on restart — the collector resumes tailing immediately but all ring buffers start
|
||
empty. The fine ring fills in 1 hour; the coarse ring fills in 24 hours.
|
||
|
||
### Running under systemd
|
||
|
||
The Debian package ships `nginx-logtail-collector.service` ready to run under the `_logtail`
|
||
system user with `Group=www-data` (for log-file access). Every flag comes from
|
||
`/etc/default/nginx-logtail`. To operate it:
|
||
|
||
```bash
|
||
sudo $EDITOR /etc/default/nginx-logtail # set COLLECTOR_LOGS / COLLECTOR_LOGTAIL_PORT
|
||
sudo systemctl enable --now nginx-logtail-collector.service
|
||
sudo systemctl status nginx-logtail-collector.service
|
||
sudo journalctl -u nginx-logtail-collector.service -f
|
||
```
|
||
|
||
If you run from source without the package, compose a unit from the packaged template at
|
||
`debian/nginx-logtail-collector.service`.
|
||
|
||
---
|
||
|
||
## Aggregator
|
||
|
||
Runs on a central machine. Subscribes to the `StreamSnapshots` push stream from every configured
|
||
collector, merges their snapshots into a unified in-memory cache, and serves the same gRPC
|
||
interface as the collector. The frontend and CLI query the aggregator exactly as they would query
|
||
a single collector.
|
||
|
||
### Flags
|
||
|
||
| Flag | Default | Description |
|
||
|----------------|-----------|--------------------------------------------------------|
|
||
| `--listen` | `:9091` | gRPC listen address |
|
||
| `--collectors` | — | Comma-separated `host:port` addresses of collectors |
|
||
| `--source` | hostname | Name for this aggregator in query responses |
|
||
|
||
`--collectors` is required; the aggregator exits immediately if it is not set.
|
||
|
||
### Example
|
||
|
||
```bash
|
||
./aggregator \
|
||
--collectors nginx1:9090,nginx2:9090,nginx3:9090 \
|
||
--listen :9091 \
|
||
--source agg-prod
|
||
```
|
||
|
||
### Fault tolerance
|
||
|
||
The aggregator reconnects to each collector independently with exponential backoff (100 ms →
|
||
doubles → cap 30 s). After 3 consecutive failures to a collector it marks that collector
|
||
**degraded**: its last-known contribution is subtracted from the merged view so stale counts
|
||
do not accumulate. When the collector recovers and sends a new snapshot, it is automatically
|
||
reintegrated. The remaining collectors continue serving queries throughout.
|
||
|
||
### Memory
|
||
|
||
The aggregator's merged cache uses the same tiered ring-buffer structure as the collector
|
||
(60 × 1-min fine, 288 × 5-min coarse) but holds at most top-50 000 entries per fine bucket
|
||
and top-5 000 per coarse bucket across all collectors combined. Memory footprint is roughly
|
||
the same as one collector (~845 MB worst case).
|
||
|
||
### Systemd unit example
|
||
|
||
```ini
|
||
[Unit]
|
||
Description=nginx-logtail aggregator
|
||
After=network.target
|
||
|
||
[Service]
|
||
ExecStart=/usr/local/bin/aggregator \
|
||
--collectors nginx1:9090,nginx2:9090,nginx3:9090 \
|
||
--listen :9091 \
|
||
--source %H
|
||
Restart=on-failure
|
||
RestartSec=5
|
||
|
||
[Install]
|
||
WantedBy=multi-user.target
|
||
```
|
||
|
||
---
|
||
|
||
## Frontend
|
||
|
||
HTTP dashboard. Connects to the aggregator (or directly to a single collector for debugging).
|
||
Zero JavaScript — server-rendered HTML with inline SVG sparklines.
|
||
|
||
### Flags
|
||
|
||
| Flag | Default | Description |
|
||
|-------------|-------------------|--------------------------------------------------|
|
||
| `--listen` | `:8080` | HTTP listen address |
|
||
| `--target` | `localhost:9091` | Default gRPC endpoint (aggregator or collector) |
|
||
| `--n` | `25` | Default number of table rows |
|
||
| `--refresh` | `30` | Auto-refresh interval in seconds; `0` to disable |
|
||
|
||
### Usage
|
||
|
||
Navigate to `http://your-host:8080`. The dashboard shows a ranked table of the top entries for
|
||
the selected dimension and time window.
|
||
|
||
**Window tabs** — switch between `1m / 5m / 15m / 60m / 6h / 24h`. Only the window changes;
|
||
all active filters are preserved.
|
||
|
||
**Dimension tabs** — switch between grouping by `website / asn / prefix / status / uri / source`.
|
||
|
||
**Drilldown** — click any table row to add that value as a filter and advance to the next
|
||
dimension in the hierarchy:
|
||
|
||
```
|
||
website → client prefix → request URI → HTTP status → ASN → source_tag → website (cycles)
|
||
```
|
||
|
||
Example: click `example.com` in the website view to see which client prefixes are hitting it;
|
||
click a prefix there to see which URIs it is requesting; and so on.
|
||
|
||
**Breadcrumb strip** — shows all active filters above the table. Click `×` next to any token
|
||
to remove just that filter, keeping the others.
|
||
|
||
**Sparkline** — inline SVG trend chart showing total request count per time bucket for the
|
||
current filter state. Useful for spotting sudden spikes or sustained DDoS ramps.
|
||
|
||
**Filter expression box** — a text input above the table accepts a mini filter language that
|
||
lets you type expressions directly without editing the URL:
|
||
|
||
```
|
||
status>=400
|
||
status>=400 AND website~=gouda.*
|
||
status>=400 AND website~=gouda.* AND uri~="^/api/"
|
||
website=example.com AND prefix=1.2.3.0/24
|
||
```
|
||
|
||
Supported fields and operators:
|
||
|
||
| Field | Operators | Example |
|
||
|-----------|---------------------|-----------------------------------|
|
||
| `status` | `=` `!=` `>` `>=` `<` `<=` | `status>=400` |
|
||
| `website` | `=` `~=` | `website~=gouda.*` |
|
||
| `uri` | `=` `~=` | `uri~=^/api/` |
|
||
| `prefix` | `=` | `prefix=1.2.3.0/24` |
|
||
| `is_tor` | `=` `!=` | `is_tor=1`, `is_tor!=0` |
|
||
| `asn` | `=` `!=` `>` `>=` `<` `<=` | `asn=8298`, `asn>=1000` |
|
||
| `source_tag` | `=` | `source_tag=direct`, `source_tag=cdn` |
|
||
|
||
`is_tor=1` and `is_tor!=0` are equivalent (TOR traffic only). `is_tor=0` and `is_tor!=1` are
|
||
equivalent (non-TOR traffic only).
|
||
|
||
`asn` accepts the same comparison expressions as `status`. Use `asn=8298` to match a single AS,
|
||
`asn>=64512` to match the private-use ASN range, or `asn!=0` to exclude unresolved entries.
|
||
|
||
`~=` means RE2 regex match. Values with spaces or quotes may be wrapped in double or single
|
||
quotes: `uri~="^/search\?q="`.
|
||
|
||
The box pre-fills with the current active filter (including filters set by drilldown clicks),
|
||
so you can see and extend what is applied. Submitting redirects to a clean URL with the
|
||
individual filter params; `× clear` removes all filters at once.
|
||
|
||
On a parse error the page re-renders with the error shown below the input and the current
|
||
data and filters unchanged.
|
||
|
||
**Status expressions** — the `f_status` URL param (and `status` in the expression box) accepts
|
||
comparison expressions: `200`, `!=200`, `>=400`, `<500`, etc.
|
||
|
||
**Regex filters** — `f_website_re` and `f_uri_re` URL params (and `~=` in the expression box)
|
||
accept RE2 regular expressions. The breadcrumb strip shows them as `website~=gouda.*` and
|
||
`uri~=^/api/` with the usual `×` remove link.
|
||
|
||
**URL sharing** — all filter state is in the URL query string (`w`, `by`, `f_website`,
|
||
`f_prefix`, `f_uri`, `f_status`, `f_website_re`, `f_uri_re`, `f_is_tor`, `f_asn`,
|
||
`f_source_tag`, `n`). Copy the URL to share an exact view with another operator, or bookmark
|
||
a recurring query.
|
||
|
||
**JSON output** — append `&raw=1` to any URL to receive the TopN result as JSON instead of
|
||
HTML. Useful for scripting without the CLI binary:
|
||
|
||
```bash
|
||
# All 429s by prefix
|
||
curl -s 'http://frontend:8080/?f_status=429&by=prefix&w=1m&raw=1' | jq '.entries[0]'
|
||
|
||
# All errors (>=400) on gouda hosts
|
||
curl -s 'http://frontend:8080/?f_status=%3E%3D400&f_website_re=gouda.*&by=uri&w=5m&raw=1'
|
||
```
|
||
|
||
**Target override** — append `?target=host:port` to point the frontend at a different gRPC
|
||
endpoint for that request (useful for comparing a single collector against the aggregator):
|
||
|
||
```bash
|
||
http://frontend:8080/?target=nginx3:9090&w=5m
|
||
```
|
||
|
||
**Source picker** — when the frontend is pointed at an aggregator, a `source:` tab row appears
|
||
below the dimension tabs listing each individual collector alongside an **all** tab (the default
|
||
merged view). Clicking a collector tab switches the frontend to query that collector directly for
|
||
the current request, letting you answer "which nginx machine is the busiest?" without leaving the
|
||
dashboard. The picker is hidden when querying a collector directly (it has no sub-sources to list).
|
||
|
||
---
|
||
|
||
## CLI
|
||
|
||
A shell companion for one-off queries and debugging. Works with any `LogtailService` endpoint —
|
||
collector or aggregator. Accepts multiple targets, fans out concurrently, and labels each result.
|
||
Default output is a human-readable table; add `--json` for machine-readable NDJSON.
|
||
|
||
### Subcommands
|
||
|
||
```
|
||
logtail-cli topn [flags] ranked label → count table
|
||
logtail-cli trend [flags] per-bucket time series
|
||
logtail-cli stream [flags] live snapshot feed (runs until Ctrl-C)
|
||
logtail-cli targets [flags] list targets known to the queried endpoint
|
||
```
|
||
|
||
### Shared flags (all subcommands)
|
||
|
||
| Flag | Default | Description |
|
||
|---------------|------------------|----------------------------------------------------------|
|
||
| `--target` | `localhost:9090` | Comma-separated `host:port` list; queries fan out to all |
|
||
| `--json` | false | Emit newline-delimited JSON instead of a table |
|
||
| `--website` | — | Filter to this website |
|
||
| `--prefix` | — | Filter to this client prefix |
|
||
| `--uri` | — | Filter to this request URI |
|
||
| `--status` | — | Filter: HTTP status expression (`200`, `!=200`, `>=400`, `<500`, …) |
|
||
| `--website-re`| — | Filter: RE2 regex against website |
|
||
| `--uri-re` | — | Filter: RE2 regex against request URI |
|
||
| `--is-tor` | — | Filter: `1` or `!=0` = TOR only; `0` or `!=1` = non-TOR only |
|
||
| `--asn` | — | Filter: ASN expression (`12345`, `!=65000`, `>=1000`, `<64512`, …) |
|
||
| `--source-tag`| — | Filter: exact `ipng_source_tag` (e.g. `direct`, `cdn`) |
|
||
|
||
### `topn` flags
|
||
|
||
| Flag | Default | Description |
|
||
|---------------|------------|-----------------------------------------------------------------------|
|
||
| `--n` | `10` | Number of entries |
|
||
| `--window` | `5m` | `1m` `5m` `15m` `60m` `6h` `24h` |
|
||
| `--group-by` | `website` | `website` `prefix` `uri` `status` `asn` `source_tag` |
|
||
|
||
### `trend` flags
|
||
|
||
| Flag | Default | Description |
|
||
|---------------|------------|----------------------------------------------------------|
|
||
| `--window` | `5m` | `1m` `5m` `15m` `60m` `6h` `24h` |
|
||
|
||
### Output format
|
||
|
||
**Table** (default — single target, no header):
|
||
```
|
||
RANK COUNT LABEL
|
||
1 18 432 example.com
|
||
2 4 211 other.com
|
||
```
|
||
|
||
**Multi-target** — each target gets a labeled section:
|
||
```
|
||
=== col-1 (nginx1:9090) ===
|
||
RANK COUNT LABEL
|
||
1 10 000 example.com
|
||
|
||
=== agg-prod (agg:9091) ===
|
||
RANK COUNT LABEL
|
||
1 18 432 example.com
|
||
```
|
||
|
||
**JSON** (`--json`) — a single JSON array with one object per target, suitable for `jq`:
|
||
```json
|
||
[{"source":"agg-prod","target":"agg:9091","entries":[{"label":"example.com","count":18432},...]}]
|
||
```
|
||
|
||
**`stream` JSON** — one object per snapshot received (NDJSON), runs until interrupted:
|
||
```json
|
||
{"ts":1773516180,"source":"col-1","target":"nginx1:9090","total_entries":823,"top_label":"example.com","top_count":10000}
|
||
```
|
||
|
||
### `targets` subcommand
|
||
|
||
Lists the targets (collectors) known to the queried endpoint. When querying an aggregator, returns
|
||
all configured collectors with their display names and addresses. When querying a collector,
|
||
returns the collector itself (address shown as `(self)`).
|
||
|
||
```bash
|
||
# List collectors behind the aggregator
|
||
logtail-cli targets --target agg:9091
|
||
|
||
# Machine-readable output
|
||
logtail-cli targets --target agg:9091 --json
|
||
```
|
||
|
||
Table output example:
|
||
```
|
||
nginx1.prod nginx1:9090
|
||
nginx2.prod nginx2:9090
|
||
nginx3.prod (self)
|
||
```
|
||
|
||
JSON output (`--json`) — one object per target:
|
||
```json
|
||
{"query_target":"agg:9091","name":"nginx1.prod","addr":"nginx1:9090"}
|
||
```
|
||
|
||
### Examples
|
||
|
||
```bash
|
||
# Top 20 client prefixes sending 429s right now
|
||
logtail-cli topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20
|
||
|
||
# Same query, pipe to jq for scripting
|
||
logtail-cli topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20 \
|
||
--json | jq '.[0].entries[0]'
|
||
|
||
# Which website has the most errors (4xx or 5xx) over the last 24h?
|
||
logtail-cli topn --target agg:9091 --window 24h --group-by website --status '>=400'
|
||
|
||
# Which client prefixes are NOT getting 200s? (anything non-success)
|
||
logtail-cli topn --target agg:9091 --window 5m --group-by prefix --status '!=200'
|
||
|
||
# Drill: top URIs on one website over the last 60 minutes
|
||
logtail-cli topn --target agg:9091 --window 60m --group-by uri --website api.example.com
|
||
|
||
# Filter by website regex: all gouda hosts
|
||
logtail-cli topn --target agg:9091 --window 5m --website-re 'gouda.*'
|
||
|
||
# Filter by URI regex: all /api/ paths
|
||
logtail-cli topn --target agg:9091 --window 5m --group-by uri --uri-re '^/api/'
|
||
|
||
# Show only TOR traffic — which websites are TOR clients hitting?
|
||
logtail-cli topn --target agg:9091 --window 5m --is-tor 1
|
||
|
||
# Show non-TOR traffic only — exclude exit nodes from the view
|
||
logtail-cli topn --target agg:9091 --window 5m --is-tor 0
|
||
|
||
# Top ASNs by request count over the last 5 minutes
|
||
logtail-cli topn --target agg:9091 --window 5m --group-by asn
|
||
|
||
# Which ASNs are generating the most 429s?
|
||
logtail-cli topn --target agg:9091 --window 5m --group-by asn --status 429
|
||
|
||
# Filter to traffic from a specific ASN
|
||
logtail-cli topn --target agg:9091 --window 5m --asn 8298
|
||
|
||
# Filter to traffic from private-use / unallocated ASNs
|
||
logtail-cli topn --target agg:9091 --window 5m --group-by prefix --asn '>=64512'
|
||
|
||
# Exclude unresolved entries (ASN 0) and show top source ASNs
|
||
logtail-cli topn --target agg:9091 --window 5m --group-by asn --asn '!=0'
|
||
|
||
# Compare two collectors side by side in one command
|
||
logtail-cli topn --target nginx1:9090,nginx2:9090 --window 5m
|
||
|
||
# Query both a collector and the aggregator at once
|
||
logtail-cli topn --target nginx3:9090,agg:9091 --window 5m --group-by prefix
|
||
|
||
# Trend of total traffic over 6h (for a quick sparkline in the terminal)
|
||
logtail-cli trend --target agg:9091 --window 6h --json | jq '.[0].points | [.[].count]'
|
||
|
||
# Watch live merged snapshots from the aggregator
|
||
logtail-cli stream --target agg:9091
|
||
|
||
# Watch two collectors simultaneously; each snapshot is labeled by source
|
||
logtail-cli stream --target nginx1:9090,nginx2:9090
|
||
```
|
||
|
||
The `stream` subcommand reconnects automatically after errors (5 s backoff) and runs until
|
||
interrupted with Ctrl-C. The `topn` and `trend` subcommands exit immediately after one response.
|
||
|
||
---
|
||
|
||
## Operational notes
|
||
|
||
**No persistence.** All data is in-memory. A collector restart loses ring buffer history but
|
||
resumes tailing the log file from the current position immediately.
|
||
|
||
**No TLS.** Designed for trusted internal networks. If you need encryption in transit, put a
|
||
TLS-terminating proxy (e.g. stunnel, nginx stream) in front of the gRPC port.
|
||
|
||
**inotify limits.** The collector uses a single inotify instance regardless of how many files it
|
||
tails. If you tail files across many different directories, check
|
||
`/proc/sys/fs/inotify/max_user_watches` (default 8192); increase it if needed:
|
||
```bash
|
||
echo 65536 | sudo tee /proc/sys/fs/inotify/max_user_watches
|
||
```
|
||
|
||
**High-cardinality attacks.** If a DDoS sends traffic from thousands of unique /24 prefixes with
|
||
unique URIs, the live map will hit its 100 000 entry cap and drop new keys for the rest of that
|
||
minute. The top-K entries already tracked continue accumulating counts. This is by design — the
|
||
cap prevents memory exhaustion under attack conditions.
|
||
|
||
**Clock skew.** Trend sparklines are based on the collector's local clock. If collectors have
|
||
significant clock skew, trend buckets from different collectors may not align precisely in the
|
||
aggregator. NTP sync is recommended.
|