nginx-logtail/docs/USERGUIDE.md

# nginx-logtail User Guide

## Overview

nginx-logtail is a three-component system for real-time traffic analysis across a cluster of nginx
machines. It answers questions like:

- Which client prefix is causing the most HTTP 429s right now?
- Which website is getting the most 503s over the last 24 hours?
- Which nginx machine is the busiest?
- Is there a DDoS in progress, and from where?

Components:

| Binary        | Runs on          | Role                                               |
|---------------|------------------|----------------------------------------------------|
| `collector`   | each nginx host  | Tails log files, aggregates in memory, serves gRPC |
| `aggregator`  | central host     | Merges all collectors, serves unified gRPC         |
| `frontend`    | central host     | HTTP dashboard with drilldown UI                   |
| `cli`         | operator laptop  | Shell queries against collector or aggregator      |

---

## nginx Configuration

Add the `logtail` log format to your `nginx.conf` and apply it to each `server` block:

```nginx
http {
    log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time';

    server {
        access_log /var/log/nginx/access.log logtail;
        # or per-vhost:
        access_log /var/log/nginx/www.example.com.access.log logtail;
    }
}
```

The format is tab-separated with fixed field positions. Query strings are stripped from the URI
by the collector at ingest time — only the path is tracked.

---

## Building

```bash
git clone https://git.ipng.ch/ipng/nginx-logtail
cd nginx-logtail
go build ./cmd/collector/
go build ./cmd/aggregator/
go build ./cmd/frontend/
go build ./cmd/cli/
```

Requires Go 1.21+. No CGO, no external runtime dependencies.

---

## Collector

Runs on each nginx machine. Tails log files, maintains in-memory top-K counters across six time
windows, and exposes a gRPC interface for the aggregator (and directly for the CLI).

### Flags

| Flag           | Default      | Description                                               |
|----------------|--------------|-----------------------------------------------------------|
| `--listen`     | `:9090`      | gRPC listen address                                       |
| `--logs`       | —            | Comma-separated log file paths or glob patterns           |
| `--logs-file`  | —            | File containing one log path/glob per line                |
| `--source`     | hostname     | Name for this collector in query responses                |
| `--v4prefix`   | `24`         | IPv4 prefix length for client bucketing (e.g. /24 → /23) |
| `--v6prefix`   | `48`         | IPv6 prefix length for client bucketing                   |

At least one of `--logs` or `--logs-file` is required.

### Examples

```bash
# Single file
./collector --logs /var/log/nginx/access.log

# Multiple files via glob (one inotify instance regardless of count)
./collector --logs "/var/log/nginx/*/access.log"

# Many files via a config file
./collector --logs-file /etc/nginx-logtail/logs.conf

# Custom prefix lengths and listen address
./collector \
  --logs "/var/log/nginx/*.log" \
  --listen :9091 \
  --source nginx3.prod \
  --v4prefix 24 \
  --v6prefix 48
```

### logs-file format

One path or glob pattern per line. Lines starting with `#` are ignored.

```
# /etc/nginx-logtail/logs.conf
/var/log/nginx/access.log
/var/log/nginx/*/access.log
/var/log/nginx/api.example.com.access.log
```

### Log rotation

The collector handles logrotate automatically. On `RENAME`/`REMOVE` events it drains the old file
descriptor to EOF (so no lines are lost), then retries opening the original path with backoff until
the new file appears. No restart or SIGHUP required.

### Memory usage

The collector is designed to stay well under 1 GB:

| Structure                   | Max entries | Approx size |
|-----------------------------|-------------|-------------|
| Live map (current minute)   | 100 000     | ~19 MB      |
| Fine ring (60 × 1-min)      | 60 × 50 000 | ~558 MB     |
| Coarse ring (288 × 5-min)   | 288 × 5 000 | ~268 MB     |
| **Total**                   |             | **~845 MB** |

When the live map reaches 100 000 distinct 4-tuples, new keys are dropped for the rest of that
minute. Existing keys continue to accumulate counts. The cap resets at each minute rotation.

### Time windows

Data is served from two tiered ring buffers:

| Window | Source ring | Resolution |
|--------|-------------|------------|
| 1 min  | fine        | 1 minute   |
| 5 min  | fine        | 1 minute   |
| 15 min | fine        | 1 minute   |
| 60 min | fine        | 1 minute   |
| 6 h    | coarse      | 5 minutes  |
| 24 h   | coarse      | 5 minutes  |

History is lost on restart — the collector resumes tailing immediately but all ring buffers start
empty. The fine ring fills in 1 hour; the coarse ring fills in 24 hours.

### Systemd unit example

```ini
[Unit]
Description=nginx-logtail collector
After=network.target

[Service]
ExecStart=/usr/local/bin/collector \
  --logs-file /etc/nginx-logtail/logs.conf \
  --listen :9090 \
  --source %H
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
```

---

## Aggregator

Runs on a central machine. Connects to all collectors via gRPC streaming, merges their snapshots
into a unified view, and serves the same gRPC interface as the collector.

### Flags

| Flag           | Default   | Description                                            |
|----------------|-----------|--------------------------------------------------------|
| `--listen`     | `:9091`   | gRPC listen address                                    |
| `--collectors` | —         | Comma-separated `host:port` addresses of collectors    |
| `--source`     | hostname  | Name for this aggregator in query responses            |

### Example

```bash
./aggregator \
  --collectors nginx1:9090,nginx2:9090,nginx3:9090 \
  --listen :9091
```

The aggregator tolerates collector failures — if one collector is unreachable, results from the
remaining collectors are returned with a warning. It reconnects automatically with backoff.

---

## Frontend

HTTP dashboard. Connects to the aggregator (or directly to a single collector for debugging).

### Flags

| Flag        | Default      | Description                           |
|-------------|--------------|---------------------------------------|
| `--listen`  | `:8080`      | HTTP listen address                   |
| `--target`  | `localhost:9091` | gRPC address of aggregator or collector |

### Usage

Navigate to `http://your-host:8080`. The dashboard shows a ranked table of the top entries for
the selected dimension and time window.

**Filter controls:**
- Click any row to add that value as a filter (e.g. click a website to restrict to it)
- The filter breadcrumb at the top shows all active filters; click any token to remove it
- Use the window tabs to switch between 1m / 5m / 15m / 60m / 6h / 24h
- The page auto-refreshes every 30 seconds

**Dimension selector:** switch between grouping by Website, Client Prefix, Request URI, or HTTP
Status using the tabs at the top of the table.

**Sparkline:** the trend chart shows total request count per bucket for the selected window and
active filters. Useful for spotting sudden spikes.

**URL sharing:** all filter state is in the URL query string — copy the URL to share a specific
view with another operator.

---

## CLI

A shell companion for one-off queries and debugging. Outputs JSON; pipe to `jq` for filtering.

### Subcommands

```
cli topn   --target HOST:PORT [filters] [--by DIM] [--window W] [--n N] [--pretty]
cli trend  --target HOST:PORT [filters] [--window W] [--pretty]
cli stream --target HOST:PORT [--pretty]
```

### Common flags

| Flag          | Default          | Description                                              |
|---------------|------------------|----------------------------------------------------------|
| `--target`    | `localhost:9090` | gRPC address of collector or aggregator                  |
| `--by`        | `website`        | Dimension: `website` `prefix` `uri` `status`             |
| `--window`    | `5m`             | Window: `1m` `5m` `15m` `60m` `6h` `24h`                |
| `--n`         | `10`             | Number of results                                        |
| `--website`   | —                | Filter to this website                                   |
| `--prefix`    | —                | Filter to this client prefix                             |
| `--uri`       | —                | Filter to this request URI                               |
| `--status`    | —                | Filter to this HTTP status code                          |
| `--pretty`    | false            | Pretty-print JSON                                        |

### Examples

```bash
# Top 20 client prefixes sending 429s right now
cli topn --target agg:9091 --window 1m --by prefix --status 429 --n 20 | jq '.entries[]'

# Which website has the most 503s in the last 24h?
cli topn --target agg:9091 --window 24h --by website --status 503

# Trend of 429s on one site over 6h — pipe to a quick graph
cli trend --target agg:9091 --window 6h --website api.example.com \
  | jq '[.points[] | {t: .time, n: .count}]'

# Watch live snapshots from one collector; alert on large entry counts
cli stream --target nginx3:9090 | jq -c 'select(.entry_count > 50000)'

# Query a single collector directly (bypass aggregator)
cli topn --target nginx1:9090 --window 5m --by prefix --pretty
```

The `stream` subcommand emits one JSON object per line (NDJSON) and runs until interrupted.
Exit code is non-zero on any gRPC error.

---

## Operational notes

**No persistence.** All data is in-memory. A collector restart loses ring buffer history but
resumes tailing the log file from the current position immediately.

**No TLS.** Designed for trusted internal networks. If you need encryption in transit, put a
TLS-terminating proxy (e.g. stunnel, nginx stream) in front of the gRPC port.

**inotify limits.** The collector uses a single inotify instance regardless of how many files it
tails. If you tail files across many different directories, check
`/proc/sys/fs/inotify/max_user_watches` (default 8192); increase it if needed:
```bash
echo 65536 | sudo tee /proc/sys/fs/inotify/max_user_watches
```

**High-cardinality attacks.** If a DDoS sends traffic from thousands of unique /24 prefixes with
unique URIs, the live map will hit its 100 000 entry cap and drop new keys for the rest of that
minute. The top-K entries already tracked continue accumulating counts. This is by design — the
cap prevents memory exhaustion under attack conditions.

**Clock skew.** Trend sparklines are based on the collector's local clock. If collectors have
significant clock skew, trend buckets from different collectors may not align precisely in the
aggregator. NTP sync is recommended.