Collector implementation
This commit is contained in:
299
docs/USERGUIDE.md
Normal file
299
docs/USERGUIDE.md
Normal file
@@ -0,0 +1,299 @@
|
||||
# nginx-logtail User Guide
|
||||
|
||||
## Overview
|
||||
|
||||
nginx-logtail is a three-component system for real-time traffic analysis across a cluster of nginx
|
||||
machines. It answers questions like:
|
||||
|
||||
- Which client prefix is causing the most HTTP 429s right now?
|
||||
- Which website is getting the most 503s over the last 24 hours?
|
||||
- Which nginx machine is the busiest?
|
||||
- Is there a DDoS in progress, and from where?
|
||||
|
||||
Components:
|
||||
|
||||
| Binary | Runs on | Role |
|
||||
|---------------|------------------|----------------------------------------------------|
|
||||
| `collector` | each nginx host | Tails log files, aggregates in memory, serves gRPC |
|
||||
| `aggregator` | central host | Merges all collectors, serves unified gRPC |
|
||||
| `frontend` | central host | HTTP dashboard with drilldown UI |
|
||||
| `cli` | operator laptop | Shell queries against collector or aggregator |
|
||||
|
||||
---
|
||||
|
||||
## nginx Configuration
|
||||
|
||||
Add the `logtail` log format to your `nginx.conf` and apply it to each `server` block:
|
||||
|
||||
```nginx
|
||||
http {
|
||||
log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time';
|
||||
|
||||
server {
|
||||
access_log /var/log/nginx/access.log logtail;
|
||||
# or per-vhost:
|
||||
access_log /var/log/nginx/www.example.com.access.log logtail;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The format is tab-separated with fixed field positions. Query strings are stripped from the URI
|
||||
by the collector at ingest time — only the path is tracked.
|
||||
|
||||
---
|
||||
|
||||
## Building
|
||||
|
||||
```bash
|
||||
git clone https://git.ipng.ch/ipng/nginx-logtail
|
||||
cd nginx-logtail
|
||||
go build ./cmd/collector/
|
||||
go build ./cmd/aggregator/
|
||||
go build ./cmd/frontend/
|
||||
go build ./cmd/cli/
|
||||
```
|
||||
|
||||
Requires Go 1.21+. No CGO, no external runtime dependencies.
|
||||
|
||||
---
|
||||
|
||||
## Collector
|
||||
|
||||
Runs on each nginx machine. Tails log files, maintains in-memory top-K counters across six time
|
||||
windows, and exposes a gRPC interface for the aggregator (and directly for the CLI).
|
||||
|
||||
### Flags
|
||||
|
||||
| Flag | Default | Description |
|
||||
|----------------|--------------|-----------------------------------------------------------|
|
||||
| `--listen` | `:9090` | gRPC listen address |
|
||||
| `--logs` | — | Comma-separated log file paths or glob patterns |
|
||||
| `--logs-file` | — | File containing one log path/glob per line |
|
||||
| `--source` | hostname | Name for this collector in query responses |
|
||||
| `--v4prefix` | `24` | IPv4 prefix length for client bucketing (e.g. /24 → /23) |
|
||||
| `--v6prefix` | `48` | IPv6 prefix length for client bucketing |
|
||||
|
||||
At least one of `--logs` or `--logs-file` is required.
|
||||
|
||||
### Examples
|
||||
|
||||
```bash
|
||||
# Single file
|
||||
./collector --logs /var/log/nginx/access.log
|
||||
|
||||
# Multiple files via glob (one inotify instance regardless of count)
|
||||
./collector --logs "/var/log/nginx/*/access.log"
|
||||
|
||||
# Many files via a config file
|
||||
./collector --logs-file /etc/nginx-logtail/logs.conf
|
||||
|
||||
# Custom prefix lengths and listen address
|
||||
./collector \
|
||||
--logs "/var/log/nginx/*.log" \
|
||||
--listen :9091 \
|
||||
--source nginx3.prod \
|
||||
--v4prefix 24 \
|
||||
--v6prefix 48
|
||||
```
|
||||
|
||||
### logs-file format
|
||||
|
||||
One path or glob pattern per line. Lines starting with `#` are ignored.
|
||||
|
||||
```
|
||||
# /etc/nginx-logtail/logs.conf
|
||||
/var/log/nginx/access.log
|
||||
/var/log/nginx/*/access.log
|
||||
/var/log/nginx/api.example.com.access.log
|
||||
```
|
||||
|
||||
### Log rotation
|
||||
|
||||
The collector handles logrotate automatically. On `RENAME`/`REMOVE` events it drains the old file
|
||||
descriptor to EOF (so no lines are lost), then retries opening the original path with backoff until
|
||||
the new file appears. No restart or SIGHUP required.
|
||||
|
||||
### Memory usage
|
||||
|
||||
The collector is designed to stay well under 1 GB:
|
||||
|
||||
| Structure | Max entries | Approx size |
|
||||
|-----------------------------|-------------|-------------|
|
||||
| Live map (current minute) | 100 000 | ~19 MB |
|
||||
| Fine ring (60 × 1-min) | 60 × 50 000 | ~558 MB |
|
||||
| Coarse ring (288 × 5-min) | 288 × 5 000 | ~268 MB |
|
||||
| **Total** | | **~845 MB** |
|
||||
|
||||
When the live map reaches 100 000 distinct 4-tuples, new keys are dropped for the rest of that
|
||||
minute. Existing keys continue to accumulate counts. The cap resets at each minute rotation.
|
||||
|
||||
### Time windows
|
||||
|
||||
Data is served from two tiered ring buffers:
|
||||
|
||||
| Window | Source ring | Resolution |
|
||||
|--------|-------------|------------|
|
||||
| 1 min | fine | 1 minute |
|
||||
| 5 min | fine | 1 minute |
|
||||
| 15 min | fine | 1 minute |
|
||||
| 60 min | fine | 1 minute |
|
||||
| 6 h | coarse | 5 minutes |
|
||||
| 24 h | coarse | 5 minutes |
|
||||
|
||||
History is lost on restart — the collector resumes tailing immediately but all ring buffers start
|
||||
empty. The fine ring fills in 1 hour; the coarse ring fills in 24 hours.
|
||||
|
||||
### Systemd unit example
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=nginx-logtail collector
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
ExecStart=/usr/local/bin/collector \
|
||||
--logs-file /etc/nginx-logtail/logs.conf \
|
||||
--listen :9090 \
|
||||
--source %H
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Aggregator
|
||||
|
||||
Runs on a central machine. Connects to all collectors via gRPC streaming, merges their snapshots
|
||||
into a unified view, and serves the same gRPC interface as the collector.
|
||||
|
||||
### Flags
|
||||
|
||||
| Flag | Default | Description |
|
||||
|----------------|-----------|--------------------------------------------------------|
|
||||
| `--listen` | `:9091` | gRPC listen address |
|
||||
| `--collectors` | — | Comma-separated `host:port` addresses of collectors |
|
||||
| `--source` | hostname | Name for this aggregator in query responses |
|
||||
|
||||
### Example
|
||||
|
||||
```bash
|
||||
./aggregator \
|
||||
--collectors nginx1:9090,nginx2:9090,nginx3:9090 \
|
||||
--listen :9091
|
||||
```
|
||||
|
||||
The aggregator tolerates collector failures — if one collector is unreachable, results from the
|
||||
remaining collectors are returned with a warning. It reconnects automatically with backoff.
|
||||
|
||||
---
|
||||
|
||||
## Frontend
|
||||
|
||||
HTTP dashboard. Connects to the aggregator (or directly to a single collector for debugging).
|
||||
|
||||
### Flags
|
||||
|
||||
| Flag | Default | Description |
|
||||
|-------------|--------------|---------------------------------------|
|
||||
| `--listen` | `:8080` | HTTP listen address |
|
||||
| `--target` | `localhost:9091` | gRPC address of aggregator or collector |
|
||||
|
||||
### Usage
|
||||
|
||||
Navigate to `http://your-host:8080`. The dashboard shows a ranked table of the top entries for
|
||||
the selected dimension and time window.
|
||||
|
||||
**Filter controls:**
|
||||
- Click any row to add that value as a filter (e.g. click a website to restrict to it)
|
||||
- The filter breadcrumb at the top shows all active filters; click any token to remove it
|
||||
- Use the window tabs to switch between 1m / 5m / 15m / 60m / 6h / 24h
|
||||
- The page auto-refreshes every 30 seconds
|
||||
|
||||
**Dimension selector:** switch between grouping by Website, Client Prefix, Request URI, or HTTP
|
||||
Status using the tabs at the top of the table.
|
||||
|
||||
**Sparkline:** the trend chart shows total request count per bucket for the selected window and
|
||||
active filters. Useful for spotting sudden spikes.
|
||||
|
||||
**URL sharing:** all filter state is in the URL query string — copy the URL to share a specific
|
||||
view with another operator.
|
||||
|
||||
---
|
||||
|
||||
## CLI
|
||||
|
||||
A shell companion for one-off queries and debugging. Outputs JSON; pipe to `jq` for filtering.
|
||||
|
||||
### Subcommands
|
||||
|
||||
```
|
||||
cli topn --target HOST:PORT [filters] [--by DIM] [--window W] [--n N] [--pretty]
|
||||
cli trend --target HOST:PORT [filters] [--window W] [--pretty]
|
||||
cli stream --target HOST:PORT [--pretty]
|
||||
```
|
||||
|
||||
### Common flags
|
||||
|
||||
| Flag | Default | Description |
|
||||
|---------------|------------------|----------------------------------------------------------|
|
||||
| `--target` | `localhost:9090` | gRPC address of collector or aggregator |
|
||||
| `--by` | `website` | Dimension: `website` `prefix` `uri` `status` |
|
||||
| `--window` | `5m` | Window: `1m` `5m` `15m` `60m` `6h` `24h` |
|
||||
| `--n` | `10` | Number of results |
|
||||
| `--website` | — | Filter to this website |
|
||||
| `--prefix` | — | Filter to this client prefix |
|
||||
| `--uri` | — | Filter to this request URI |
|
||||
| `--status` | — | Filter to this HTTP status code |
|
||||
| `--pretty` | false | Pretty-print JSON |
|
||||
|
||||
### Examples
|
||||
|
||||
```bash
|
||||
# Top 20 client prefixes sending 429s right now
|
||||
cli topn --target agg:9091 --window 1m --by prefix --status 429 --n 20 | jq '.entries[]'
|
||||
|
||||
# Which website has the most 503s in the last 24h?
|
||||
cli topn --target agg:9091 --window 24h --by website --status 503
|
||||
|
||||
# Trend of 429s on one site over 6h — pipe to a quick graph
|
||||
cli trend --target agg:9091 --window 6h --website api.example.com \
|
||||
| jq '[.points[] | {t: .time, n: .count}]'
|
||||
|
||||
# Watch live snapshots from one collector; alert on large entry counts
|
||||
cli stream --target nginx3:9090 | jq -c 'select(.entry_count > 50000)'
|
||||
|
||||
# Query a single collector directly (bypass aggregator)
|
||||
cli topn --target nginx1:9090 --window 5m --by prefix --pretty
|
||||
```
|
||||
|
||||
The `stream` subcommand emits one JSON object per line (NDJSON) and runs until interrupted.
|
||||
Exit code is non-zero on any gRPC error.
|
||||
|
||||
---
|
||||
|
||||
## Operational notes
|
||||
|
||||
**No persistence.** All data is in-memory. A collector restart loses ring buffer history but
|
||||
resumes tailing the log file from the current position immediately.
|
||||
|
||||
**No TLS.** Designed for trusted internal networks. If you need encryption in transit, put a
|
||||
TLS-terminating proxy (e.g. stunnel, nginx stream) in front of the gRPC port.
|
||||
|
||||
**inotify limits.** The collector uses a single inotify instance regardless of how many files it
|
||||
tails. If you tail files across many different directories, check
|
||||
`/proc/sys/fs/inotify/max_user_watches` (default 8192); increase it if needed:
|
||||
```bash
|
||||
echo 65536 | sudo tee /proc/sys/fs/inotify/max_user_watches
|
||||
```
|
||||
|
||||
**High-cardinality attacks.** If a DDoS sends traffic from thousands of unique /24 prefixes with
|
||||
unique URIs, the live map will hit its 100 000 entry cap and drop new keys for the rest of that
|
||||
minute. The top-K entries already tracked continue accumulating counts. This is by design — the
|
||||
cap prevents memory exhaustion under attack conditions.
|
||||
|
||||
**Clock skew.** Trend sparklines are based on the collector's local clock. If collectors have
|
||||
significant clock skew, trend buckets from different collectors may not align precisely in the
|
||||
aggregator. NTP sync is recommended.
|
||||
Reference in New Issue
Block a user