nginx-logtail/docs/design.md

<!-- SPDX-License-Identifier: Apache-2.0 -->
# nginx-logtail Design Document

## Metadata

| | |
| --- | --- |
| **Status** | Describes intended behavior as of `v0.2.0` |
| **Author** | Pim van Pelt `<pim@ipng.ch>` |
| **Last updated** | 2026-04-17 |
| **Audience** | Operators and contributors running real-time traffic analysis and DDoS detection across a fleet of nginx hosts |

The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and **MAY** are used as described in
[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are reserved in this document for requirements that are intended to be
enforced in code or by an external dependency. Plain-language descriptions of what the system or an operator can do are written in
lowercase — "can", "will", "does" — and should not be read as normative.

## Summary

`nginx-logtail` is a four-binary Go system for real-time analysis of nginx traffic across a fleet of hosts. Each nginx host runs a
**collector** that ingests logs (from files via `fsnotify`, from a UDP socket, or both) and maintains in-memory ranked top-K counters
across multiple time windows. A central **aggregator** subscribes to the collectors' snapshot streams and serves a merged view. An
**HTTP frontend** renders a drilldown dashboard (server-rendered HTML, zero JavaScript). A **CLI** offers the same queries as a
shell companion. All four programs speak a single gRPC service (`LogtailService`), so the frontend and CLI work against any collector
or the aggregator interchangeably.

## Background

Operators running tens of nginx hosts behind a load balancer need a live, drilldown view of request traffic for DDoS detection and
traffic analysis. Questions the system answers include:

- Which client prefix is causing the most HTTP 429s right now?
- Which website is getting the most 503s over the last 24 hours?
- Which nginx machine is the busiest?
- Is there a DDoS in progress, and from where?

Existing log-analysis pipelines (ELK, Loki, ClickHouse, etc.) answer questions like these but require infrastructure that is
disproportionate for the target workload. A handful of nginx hosts each doing ~10 K req/s at peak can be kept on a per-minute top-K
structure in ~1 GB of RAM per host, with <250 ms query latency across the whole fleet, without a storage tier.

A companion project, [`nginx-ipng-stats-plugin`](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin), adds per-device attribution in nginx
itself and can emit a logtail-format access log as UDP datagrams. `nginx-logtail` was extended in `v0.2.0` to ingest that stream
natively, so operators can run it either from on-disk log files, from the UDP feed, or both on the same host.

## Goals and Non-Goals

### Product Goals

1. **Live top-K per (website, client_prefix, URI, status, is_tor, asn, source_tag).** For every combination of these dimensions the
   system maintains an integer count, ranked so that the top entries are readily available across 1 m, 5 m, 15 m, 60 m, 6 h, and 24 h
   windows.
2. **Sub-second query latency.** `TopN` and `Trend` queries MUST return from the collector and from the aggregator in well under one
   second at the target scale (10 hosts, 10 K req/s each).
3. **Bounded memory.** The collector MUST stay within a 1 GB steady-state memory budget regardless of input cardinality, including
   during high-cardinality DDoS attacks.
4. **Two ingest paths, one data model.** On-disk log files (`fsnotify`-tailed, logrotate-aware) and UDP datagrams (from
   `nginx-ipng-stats-plugin`) MUST both feed the same in-memory structure, with a single log format per path and no operator-visible
   difference downstream.
5. **No external storage, no TLS, no CGO.** The entire system runs as four static Go binaries on a trusted internal network. Operators
   who need retention beyond the ring buffers SHOULD scrape Prometheus.
6. **One service contract.** Collectors and the aggregator implement the same gRPC `LogtailService`. Frontend and CLI MUST work
   against either interchangeably, with the collector returning "itself" from `ListTargets` and the aggregator returning its configured
   collector set.

### Non-Goals

- The system does **not** parse arbitrary nginx `log_format` strings. Two fixed tab-separated formats are supported: a file format and
  a UDP format (see FR-2). Operators who need general parsing should use Vector, Fluent Bit, or Promtail.
- The system does **not** store raw log lines. Counts are aggregated at ingest; the original log lines are not kept in memory or on
  disk. The project does not replace an access log.
- The system does **not** persist counters across restarts. Ring buffers are in-memory only. On aggregator restart, historical state
  is reconstructed by calling `DumpSnapshots` on each collector (FR-4.3). On collector restart the rings start empty and refill as new
  traffic arrives.
- The system does **not** provide per-URI request timing distributions. Latency histograms exist only in the collector's Prometheus
  exposition (per host), not in the top-K data model.
- The system does **not** ship TLS or authentication for its gRPC endpoints. Operators who expose it beyond a trusted network are
  expected to terminate TLS in a front proxy.
- The system is **not** a general-purpose metric store. The Prometheus exporter on the collector exposes a deliberately narrow set:
  per-host request counter, per-host body-size and request-time histograms, and per-`source_tag` rollup counters.

## Requirements

Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that later sections can cite it.

### Functional Requirements

**FR-1 Counter data model**

- **FR-1.1** The canonical unit of counting MUST be a 7-tuple
  `(website, client_prefix, http_request_uri, http_response, is_tor, asn, ipng_source_tag)` mapped to a 64-bit integer request count.
  The data model contains no other fields: no timing, no byte counts, no method (those live only in the Prometheus exposition,
  FR-8).
- **FR-1.2** `website` MUST be the nginx `$host` value.
- **FR-1.3** `client_prefix` MUST be the client IP truncated to a configurable prefix length, formatted as CIDR. Default `/24` for
  IPv4 and `/48` for IPv6 (flags `-v4prefix`, `-v6prefix`). Truncation happens at ingest; the original address is not retained.
- **FR-1.4** `http_request_uri` MUST be the `$request_uri` path only — the query string (from the first `?` onward) MUST be stripped
  at ingest. This is the dominant cardinality-reduction measure; DDoS traffic with attacker-generated query strings cannot grow the
  working set.
- **FR-1.5** `http_response` MUST be the HTTP status code as recorded by nginx.
- **FR-1.6** `is_tor` MUST be a boolean, populated by the operator in the log format (typically via a lookup against a TOR exit-node
  list). For the file format, lines without this field default to `false` for backward compatibility.
- **FR-1.7** `asn` MUST be an int32 decimal value sourced from MaxMind GeoIP2 (or equivalent). For the file format, lines without
  this field default to `0`.
- **FR-1.8** `ipng_source_tag` MUST be a short string identifying which attribution tag the request arrived under. For records from
  on-disk log files, the collector MUST assign the tag `"direct"` (mirroring `nginx-ipng-stats-plugin`'s default-source convention). For
  records from the UDP stream, the tag is taken from the log line as emitted by the plugin.

**FR-2 Log formats**

- **FR-2.1 File format.** The collector MUST accept nginx access logs in the following tab-separated layout, with the last two fields
  (`is_tor`, `asn`) optional for backward compatibility:

  ```nginx
  log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn';
  ```

  | # | Field             | Ingested into              |
  |---|-------------------|----------------------------|
  | 0 | `$host`           | `website`                  |
  | 1 | `$remote_addr`    | `client_prefix` (truncated)|
  | 2 | `$msec`           | (discarded)                |
  | 3 | `$request_method` | Prom `method` label        |
  | 4 | `$request_uri`    | `http_request_uri`         |
  | 5 | `$status`         | `http_response`            |
  | 6 | `$body_bytes_sent`| Prom body histogram        |
  | 7 | `$request_time`   | Prom duration histogram    |
  | 8 | `$is_tor`         | `is_tor` (optional)        |
  | 9 | `$asn`            | `asn` (optional)           |

- **FR-2.2 UDP format.** The collector MUST accept datagrams in a versioned tab-separated layout, as emitted by
  `nginx-ipng-stats-plugin`'s `ipng_stats_logtail` directive. Every datagram MUST begin with a literal version tag
  (`v<N>\t`) so the collector can route each packet to the appropriate parser. Only `v1` is defined in this revision;
  unknown versions MUST be counted as parse failures and dropped.

  ```nginx
  log_format ipng_stats_logtail 'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
  ```

  The v1 payload MUST have exactly 12 tab-separated fields after the `v1` tag (13 fields total). `$server_addr` and
  `$scheme` MUST be parsed but dropped; they are reserved for future use. Malformed datagrams (wrong version, wrong
  field count, bad IP) MUST be counted (FR-8.5) and silently dropped.

- **FR-2.3** The file tailer MUST set `source_tag="direct"` on every record it parses. The UDP listener MUST propagate
  `$ipng_source_tag` verbatim. This is the only difference in downstream processing between the two ingest paths.

**FR-3 Ring buffers and time windows**

- **FR-3.1** Each collector and the aggregator MUST maintain two tiered ring buffers:

  | Tier   | Bucket size | Buckets | Top-K/bucket | Covers |
  |--------|-------------|---------|--------------|--------|
  | Fine   | 1 min       | 60      | 50 000       | 1 h    |
  | Coarse | 5 min       | 288     | 5 000        | 24 h   |

- **FR-3.2** The `Window` enum MUST map queries to tiers as follows:

  | Window | Tier   | Buckets summed |
  |--------|--------|----------------|
  | 1 m    | fine   | 1              |
  | 5 m    | fine   | 5              |
  | 15 m   | fine   | 15             |
  | 60 m   | fine   | 60             |
  | 6 h    | coarse | 72             |
  | 24 h   | coarse | 288            |

- **FR-3.3** Every minute, the collector MUST snapshot its live map into the fine ring (top-50 000, sorted desc) and reset the live
  map. Every fifth fine tick, the collector MUST merge the most recent five fine snapshots into one coarse snapshot (top-5 000).
  The fine/coarse merge MUST be pinned to the 1-minute and 5-minute boundaries of the local clock so sparklines align across
  collectors.
- **FR-3.4** Querying MUST always read from the rings, never from the live map. A sub-minute request MUST return an empty top-1
  result rather than surfacing partially-accumulated data; this keeps per-minute results monotonic.

**FR-4 Push-based streaming and aggregation**

- **FR-4.1** The collector MUST expose a server-streaming RPC `StreamSnapshots(SnapshotRequest) → stream Snapshot` that emits one fine
  (1-min) snapshot per minute rotation. Subscribers MUST receive the same snapshot independently (per-subscriber buffered fan-out,
  bounded buffer, drop on full).
- **FR-4.2** The aggregator MUST subscribe to every configured collector via `StreamSnapshots` and merge snapshots into a single
  ring-buffer cache. The merge strategy MUST be delta-based: on each new snapshot from collector `X`, the aggregator MUST subtract
  `X`'s previous contribution and add the new entries, giving `O(snapshot_size)` per update (not `O(N_collectors × size)`).
- **FR-4.3** The aggregator MUST expose a unary `DumpSnapshots(DumpSnapshotsRequest) → stream Snapshot` on each collector that
  streams all fine buckets (with `is_coarse=false`) followed by all coarse buckets (with `is_coarse=true`). On aggregator startup, it
  MUST call `DumpSnapshots` against every collector once (concurrently, after its own gRPC server is already listening), merge the
  per-timestamp entries the same way the live path does, and load the result into its cache via a single atomic replacement.
  Collectors that return `Unimplemented` MUST be skipped without blocking live streaming from the others.
- **FR-4.4** The aggregator MUST reconnect to each collector independently with exponential backoff (100 ms → cap 30 s). After three
  consecutive connection failures the aggregator MUST zero the degraded collector's contribution (subtract its last-known snapshot
  and delete its entry). When the collector recovers and sends a new snapshot, its contribution MUST automatically be reintegrated.

**FR-5 Query service (`LogtailService`)**

- **FR-5.1** Collector and aggregator MUST implement the same gRPC `LogtailService`:

  ```protobuf
  service LogtailService {
    rpc TopN(TopNRequest)                  returns (TopNResponse);
    rpc Trend(TrendRequest)                returns (TrendResponse);
    rpc StreamSnapshots(SnapshotRequest)   returns (stream Snapshot);
    rpc ListTargets(ListTargetsRequest)    returns (ListTargetsResponse);
    rpc DumpSnapshots(DumpSnapshotsRequest)returns (stream Snapshot);
  }
  ```

- **FR-5.2** `Filter` MUST support exact, inequality, and RE2-regex constraints on the dimensions of FR-1. Status and ASN accept
  the six-operator expression language (`=`, `!=`, `>`, `>=`, `<`, `<=`). Website and URI accept regex match and regex exclusion.
  TOR filtering uses a three-state enum (`ANY`/`YES`/`NO`). Source-tag filtering is exact match only.
- **FR-5.3** `GroupBy` MUST cover every dimension of FR-1 except `is_tor` (which is boolean and rarely useful as a group-by target):
  `WEBSITE`, `CLIENT_PREFIX`, `REQUEST_URI`, `HTTP_RESPONSE`, `ASN_NUMBER`, `SOURCE_TAG`.
- **FR-5.4** `ListTargets` MUST return, from the aggregator, every configured collector with its display name and gRPC address; from
  a collector, a single entry describing itself with an empty `addr` (meaning "this endpoint").
- **FR-5.5** All queries MUST be answered from the local ring buffers. The aggregator MUST NOT fan out to collectors at query time.

**FR-6 HTTP frontend**

- **FR-6.1** The frontend MUST render a server-rendered HTML dashboard with no JavaScript, using inline SVG for sparklines and
  `<meta http-equiv="refresh">` for auto-refresh. It MUST work in text-mode browsers (w3m, lynx) and under `curl`.
- **FR-6.2** All filter, group-by, and window state MUST live in the URL query string so that URLs are shareable and bookmarkable.
  No server-side session.
- **FR-6.3** The frontend MUST provide a drilldown affordance: clicking a row MUST add that row's value as a filter and advance the
  group-by dimension through the cycle
  `website → prefix → uri → status → asn → source_tag → website`.
- **FR-6.4** The frontend MUST issue `TopN`, `Trend`, and `ListTargets` concurrently with a 5 s deadline. `Trend` failure MUST
  suppress the sparkline but not the table. `ListTargets` failure MUST hide the source picker but not the rest of the page.
- **FR-6.5** Appending `&raw=1` to any URL MUST return the `TopN` result as JSON, so the dashboard can be scripted without the CLI.
- **FR-6.6** The frontend MUST accept a `q=` parameter holding a mini filter expression (`status>=400 AND website~=gouda.*`). On
  submission it MUST parse the expression and redirect to the canonical URL with the individual `f_*` params populated; parse errors
  MUST render inline without losing the current filter state.

**FR-7 CLI**

- **FR-7.1** The CLI MUST provide four subcommands: `topn`, `trend`, `stream`, `targets`. Each subcommand MUST accept
  `--target host:port[,host:port...]` and fan out concurrently, printing results in order with per-target headers (omitted for
  single-target invocations, so output pipes cleanly into `jq`).
- **FR-7.2** The CLI MUST expose every `Filter` dimension as a dedicated flag and default to a human-readable table. `--json` MUST
  switch to newline-delimited JSON for `stream` and to a single JSON array for `topn`/`trend`.
- **FR-7.3** `stream` MUST reconnect automatically on error with a 5 s backoff and run until interrupted.

**FR-8 Prometheus exposition (collector only)**

- **FR-8.1** The collector MUST expose a Prometheus `/metrics` endpoint on `-prom-listen` (default `:9100`). Setting the flag to the
  empty string MUST disable it entirely.
- **FR-8.2** The collector MUST expose a per-request counter `nginx_http_requests_total{host, method, status}` capped at
  `promCounterCap = 250 000` distinct label sets. When the cap is reached, further new label sets MUST be dropped (existing series
  keep incrementing) until the map is rolled over.
- **FR-8.3** The collector MUST expose per-host histograms
  `nginx_http_response_body_bytes{host, le}` (body-size distribution) and
  `nginx_http_request_duration_seconds{host, le}` (request-time distribution). The duration histogram MUST NOT be split by
  `source_tag` — its bucket count would multiply without operational benefit.
- **FR-8.4** The collector MUST expose two parallel roll-ups labeled by `source_tag` only (not cross-producted with host):
  `nginx_http_requests_by_source_total{source_tag}` and
  `nginx_http_response_body_bytes_by_source{source_tag, le}`. These are separate metric names to avoid inconsistent label sets
  under a single name.
- **FR-8.5** The collector MUST expose three counters that let operators distinguish UDP parse failures from back-pressure drops:
  `logtail_udp_packets_received_total` (datagrams off the socket),
  `logtail_udp_loglines_success_total` (parsed OK), and
  `logtail_udp_loglines_consumed_total` (forwarded to the store — i.e. not dropped).

### Non-Functional Requirements

**NFR-1 Correctness under concurrency**

- **NFR-1.1** The collector MUST run a single goroutine ("the store goroutine") that owns the live map and the ring-buffer write
  path. No other goroutine MUST write to these structures. The file tailer and the UDP listener MUST communicate with the store
  goroutine through a bounded channel.
- **NFR-1.2** Readers (query RPCs and subscriber fan-out) MUST take an `RLock` on the rings. Writers MUST take a `Lock` only for the
  moment the slice header of the new snapshot is installed; serialisation and network I/O MUST happen outside the lock.
- **NFR-1.3** `DumpSnapshots` MUST copy ring headers and filled counts under `RLock` only, then release the lock before streaming.
  The minute-rotation write path MUST never observe a lock held for longer than a microsecond-scale slice copy.
- **NFR-1.4** A query that races with a rotation MUST observe a monotonically non-decreasing total for a fixed filter over a fixed
  window; it MUST NOT observe a partially-rotated state that would cause a total to decrease compared to a prior reading.

**NFR-2 Memory bounds**

- **NFR-2.1** The collector's live map MUST be hard-capped at 100 000 entries. Once the cap is reached, only updates to existing keys
  MUST proceed; new keys MUST be dropped until the next minute rotation resets the map. This bounds memory under high-cardinality
  attacks.
- **NFR-2.2** Fine-ring snapshots MUST be capped at top-50 000 entries; coarse-ring snapshots at top-5 000. The full memory budget
  for a collector is therefore approximately 845 MB (live map ~19 MB + fine ring ~558 MB + coarse ring ~268 MB).
- **NFR-2.3** The aggregator MUST apply the same tier caps as the collector. Its steady-state memory is roughly equivalent to one
  collector regardless of the number of collectors subscribed.
- **NFR-2.4** The Prometheus counter map (FR-8.2) MUST be capped at `promCounterCap = 250 000` entries. The per-host and per-source
  histograms MUST NOT be capped explicitly — they grow only with the distinct host count, which is bounded by the operator's vhost
  configuration.

**NFR-3 Performance**

- **NFR-3.1** `ParseLine` and `ParseUDPLine` MUST use `strings.Split` / `strings.SplitN` (no regex), so that per-line cost stays
  around 50 ns on commodity hardware.
- **NFR-3.2** `TopN` and `Trend` queries across the full 24-hour coarse ring MUST complete in well under 250 ms at the 50 000-entry
  fine cap, for fully-specified filters.
- **NFR-3.3** The collector's input channel MUST be sized to absorb approximately 20 s of peak load (e.g. 200 000 at 10 K lines/s)
  so that transient pauses in the store goroutine do not back up the tailer or the UDP listener.
- **NFR-3.4** When either the tailer or the UDP listener cannot enqueue a parsed record because the channel is full, the record
  MUST be dropped rather than blocking the ingest goroutine. UDP drops MUST be visible via the counters in FR-8.5; file-path drops
  are implicit (the tailer falls behind the file).

**NFR-4 Fault tolerance and recovery**

- **NFR-4.1** The file tailer MUST tolerate logrotate automatically. On `RENAME`/`REMOVE` events it MUST drain the old file
  descriptor to EOF, close it, and retry opening the original path with exponential backoff until the new file appears. No SIGHUP or
  restart MUST be required.
- **NFR-4.2** The aggregator MUST NOT block frontend queries during backfill. Its gRPC server MUST start listening first; backfill
  (FR-4.3) MUST run in a background goroutine.
- **NFR-4.3** A collector restart MUST NOT affect peer collectors or the aggregator's ability to continue serving the surviving
  collectors' data. When the restarted collector reconnects, its stream MUST resume without operator action.
- **NFR-4.4** An aggregator restart MUST recover its ring-buffer contents from all collectors via `DumpSnapshots`; live streaming
  MUST resume in parallel with backfill so that no minute is lost even during a restart.

**NFR-5 Observability of the system itself**

- **NFR-5.1** The collector MUST expose operator-facing log lines on stdout covering: file discovery, logrotate reopen events, UDP
  listener bind, subscriber connect/disconnect, and fatal configuration errors. The collector MUST NOT log anything on the per-request
  hot path.
- **NFR-5.2** The aggregator MUST log each collector's connect, disconnect, degraded transition, and recovery. Backfill MUST log a
  per-collector line with bucket counts, entry counts, and wall-clock duration.
- **NFR-5.3** The Prometheus exporter MUST be the primary out-of-band health signal. Counters FR-8.5 plus the per-host request
  counter (FR-8.2) give an operator a full view of ingest health without needing to read the logs.

**NFR-6 Security**

- **NFR-6.1** gRPC traffic MUST be cleartext HTTP/2. Operators who expose the endpoints beyond a trusted network are expected to
  terminate TLS in a front proxy.
- **NFR-6.2** The collector MUST bind its UDP listener to `127.0.0.1` by default (configurable via `-logtail-bind`) so that merely
  setting `-logtail-port` MUST NOT expose the socket to the public Internet.
- **NFR-6.3** The system MUST NOT record per-request personally-identifying data beyond what nginx already logs. Client IPs are
  truncated at ingest (FR-1.3); URIs lose their query strings (FR-1.4).

**NFR-7 Documentation and packaging**

- **NFR-7.1** The repository MUST ship `docs/user-guide.md` that walks an operator through nginx log format configuration, running
  each of the four binaries (flags, systemd examples, Docker Compose), and integrating the Prometheus exporter. It MUST contain
  enough examples that a new operator can stand up a single-host deployment end-to-end without reading the source.
- **NFR-7.2** The repository MUST ship `docs/design.md` (this document) covering the normative requirements and the architectural
  rationale.
- **NFR-7.3** All four binaries MUST build as static Go binaries with `CGO_ENABLED=0 -trimpath -ldflags="-s -w"` and MUST ship
  together in a single `scratch`-based Docker image. No OS, no shell, no runtime dependencies.

## Architecture Overview

### Process Model

The project ships four binaries:

- **`collector`** — runs on every nginx host. Ingests logs from files and/or UDP, maintains the live map and tiered rings, serves
  `LogtailService` on port 9090, and exposes Prometheus on port 9100.
- **`aggregator`** — runs centrally. Subscribes to every collector, merges snapshots, serves the same `LogtailService` on port 9091.
- **`frontend`** — runs centrally, alongside the aggregator. HTTP server on port 8080, rendering HTML against the aggregator (or any
  other `LogtailService` endpoint).
- **`cli`** — runs wherever the operator is. Talks to any `LogtailService`. No daemon.

Because all four binaries speak one service, the aggregator is optional for a single-host deployment: the frontend and CLI can point
directly at a collector.

### Data Flow

```
             ┌──────────────┐ files  ┌───────────────┐
   nginx ──▶ │ access.log   │───────▶│ file tailer   │
             │ (file mode)  │        │ (fsnotify)    │──┐
             └──────────────┘        └───────────────┘  │
                                                        │
             ┌──────────────┐  UDP   ┌───────────────┐  │
  nginx-ipng ▶ ipng_stats_  ├───────▶│ udp listener  │──┼──▶ LogRecord ──▶ ┌──────────┐
  -stats-    │ logtail      │        │ (127.0.0.1)   │  │    channel (200K)│  store    │
  plugin     └──────────────┘        └───────────────┘  │                  │ goroutine│
                                                        │                  └─────┬────┘
                                                        ▼                        │
                                               Prom exporter                     │
                                                                                 ▼
                                                                         ┌─────────────┐
                                                                         │ live map    │
                                                                         │ (≤100 K)    │
                                                                         └──────┬──────┘
                                                                                │ every 1 m
                                                                                ▼
                                                                         ┌─────────────┐
                                                                         │ fine ring   │
                                                                         │ 60×50 K     │────┐
                                                                         └──────┬──────┘    │
                                                                                │ every 5 m │
                                                                                ▼           │
                                                                         ┌─────────────┐    │
                                                                         │ coarse ring │    │
                                                                         │ 288×5 K     │    │
                                                                         └─────────────┘    │
                                                                                            │
                                                     ┌──────────────────────────────────────┘
                                                     │ StreamSnapshots (push)
                                                     ▼
                                               aggregator ──▶ merged cache ──▶ frontend / CLI
```

Requests enter nginx. The nginx writes either to a log file (file mode) or via the `ipng_stats_logtail` directive to a UDP socket
(UDP mode), or both. The collector has two ingest goroutines that parse a line into a `LogRecord` and enqueue it on a shared 200 K
channel. A single store goroutine consumes the channel, updating the live map and maintaining the tiered rings. A once-per-minute
timer rotates the live map into the fine ring and (every fifth tick) into the coarse ring, and fans the fresh snapshot out to every
`StreamSnapshots` subscriber. The aggregator is one such subscriber.

Query RPCs (`TopN`, `Trend`) MUST read only from the rings and MUST NOT read from the live map. The aggregator's cache is itself a
ring built from the merged-view snapshots; it is updated on the same 1-minute cadence regardless of how many collectors are
connected.

## Components

### Program 1 — Collector (`cmd/collector`)

#### Responsibilities

- Tail on-disk log files via a single `fsnotify.Watcher`, handle logrotate, and re-scan glob patterns periodically to pick up new
  files (FR-2.1, NFR-4.1).
- Listen on an optional UDP socket for `ipng_stats_logtail` datagrams (FR-2.2).
- Parse each log line into a `LogRecord` (FR-1).
- Maintain the live map, fine ring, coarse ring, and subscriber fan-out under a single-writer goroutine (FR-3, NFR-1).
- Serve `LogtailService` on `-listen` (FR-5).
- Expose Prometheus metrics on `-prom-listen` (FR-8).

#### Key data types

- `LogRecord` — ten fields (website, client_prefix, URI, status, is_tor, asn, method, body_bytes_sent, request_time, source_tag).
  Produced by `ParseLine` or `ParseUDPLine` and consumed by the store goroutine.
- `Tuple6` (historical name; carries seven fields now) — the aggregation key. NUL-separated when encoded as a map key for snapshots.
  The code name is intentionally stable so downstream tests and consumers are not churned.
- `Snapshot` — `(timestamp, []Entry)` where `Entry = (label, count)` and `label` is an encoded `Tuple6`.

#### Presents

- `LogtailService` on TCP (default `:9090`).
- A Prometheus `/metrics` handler on TCP (default `:9100`).

#### Consumes

- One or more on-disk log files matched by `--logs` and/or `--logs-file` globs.
- Optionally, a UDP socket on `--logtail-bind:--logtail-port` (default `127.0.0.1`, disabled when port is `0`).

### Program 2 — Aggregator (`cmd/aggregator`)

#### Responsibilities

- Dial every configured collector and subscribe via `StreamSnapshots` (FR-4.2).
- Merge incoming snapshots into a single cache using delta-based subtraction, so a collector's contribution is updated in place
  rather than accumulated (FR-4.2).
- At startup, call `DumpSnapshots` on each collector once, merge the per-timestamp entries, and load the result into the cache
  atomically (FR-4.3).
- Handle collector outages with exponential-backoff reconnect and degraded-collector zeroing (FR-4.4).
- Serve the same `LogtailService` as the collector (FR-5).
- Maintain a `TargetRegistry` that maps collector addresses to display names (updated from the `source` field of incoming
  snapshots).

#### Presents

- `LogtailService` on TCP (default `:9091`).

#### Consumes

- The `StreamSnapshots` and `DumpSnapshots` RPCs on every configured collector (`--collectors`).

### Program 3 — Frontend (`cmd/frontend`)

#### Responsibilities

- Render the drilldown dashboard server-side with no JavaScript (FR-6.1).
- Parse URL query string into filter / group-by / window state (FR-6.2).
- Issue `TopN`, `Trend`, and `ListTargets` concurrently with a 5 s deadline (FR-6.4).
- Render inline SVG sparklines from `TrendResponse` (FR-6.1).
- Support the mini filter-expression language (FR-6.6) and the `raw=1` JSON output (FR-6.5).
- Expose a source-picker row populated from `ListTargets`.

#### Presents

- An HTTP dashboard on TCP (default `:8080`).

#### Consumes

- Any `LogtailService` endpoint (`--target`, default `localhost:9091` — the aggregator).

### Program 4 — CLI (`cmd/cli`)

#### Responsibilities

- Dispatch to `topn`, `trend`, `stream`, or `targets` (FR-7.1).
- Parse shared and per-subcommand flags, build a `Filter` proto from them, and fan out to every `--target` concurrently (FR-7.2).
- Print human-readable tables by default; switch to JSON with `--json` (FR-7.2).
- Reconnect automatically in `stream` mode (FR-7.3).

#### Presents

- Exit status `0` on success, non-zero on RPC error (except `stream`, which runs until interrupted).

#### Consumes

- Any `LogtailService` endpoint.

### Protobuf service (`proto/logtail.proto`)

One proto file defines every shared type: `Tuple6` is encoded as a NUL-separated label string inside `TopNEntry`, and the
`Snapshot` message carries both fine (1-min) and coarse (5-min) ring contents. `GroupBy` and `Window` are enums; `Filter` carries
optional exact-match fields, regex fields, and the `StatusOp` comparison enum used for both `http_response` and `asn_number`.

## Operational Concerns

### Deployment Topology

A typical deployment is:

- **Per nginx host:** one `collector` systemd unit, pointed at `/var/log/nginx/*.log` and/or listening on `127.0.0.1:9514` for the
  `nginx-ipng-stats-plugin` UDP stream. Exposes `:9090` (gRPC) and `:9100` (Prometheus).
- **Central:** one `aggregator` systemd unit on e.g. `agg:9091`, subscribed to all collectors; and one `frontend` systemd unit on
  `agg:8080`, pointed at the aggregator. Operators reach the dashboard via `http://agg:8080/`. Alternatively, the Docker Compose
  file in the repo root runs the aggregator and frontend together.
- **Operator laptop:** `logtail-cli` invocations, pointed at the aggregator for fleet-wide questions or at a specific collector for
  a single-host drilldown.

### Configuration

All four binaries are configured via flags with matching environment variables. The canonical reference is `docs/user-guide.md`.
Representative settings:

- `collector`: `--logs /var/log/nginx/*.log`, `--logtail-port 9514`, `--source $(hostname)`, `--prom-listen :9100`.
- `aggregator`: `--collectors nginx1:9090,nginx2:9090`, `--listen :9091`.
- `frontend`: `--target agg:9091`, `--listen :8080`.
- `cli`: no persistent configuration; every invocation carries `--target`.

### Reload and Restart Semantics

- **Collector restart.** The live map and both rings start empty. The file tailer resumes at EOF of each watched file (no historical
  replay). The fine ring refills within an hour; the coarse ring within 24 hours.
- **Aggregator restart.** Backfill reconstructs the cache from all collectors' `DumpSnapshots` streams. The gRPC server is listening
  before backfill begins (NFR-4.2), so the frontend is never blocked during restart — it just sees an incomplete cache for the few
  seconds backfill takes.
- **Collector outage.** The aggregator reconnects with backoff; after three consecutive failures the collector's contribution is
  zeroed (FR-4.4) so the merged view does not show stale counts. On recovery the zeroing is reversed by the next snapshot.
- **nginx logrotate.** The collector drains the old fd, closes, and retries the original path. No operator action (NFR-4.1).
- **nginx-ipng-stats-plugin reload.** The plugin's UDP socket is per-worker; a reload simply causes new workers to open fresh
  sockets to the same address. The collector sees a brief gap and resumes.

### Observability of the System Itself

Primary channel is the collector's Prometheus endpoint (FR-8). Beyond the per-host request counter and the per-source roll-ups,
three UDP counters give direct visibility into the UDP ingest path:

- `logtail_udp_packets_received_total` — what arrived.
- `logtail_udp_loglines_success_total` — what parsed cleanly.
- `logtail_udp_loglines_consumed_total` — what made it to the store (i.e. was not dropped by a full channel).

`received - success` is the parse-failure rate; `success - consumed` is the back-pressure drop rate. Operators should alert on both
being non-zero.

Each binary logs human-readable lines on stdout for connect/disconnect events, logrotate reopen, backfill timing, and degraded
transitions. No per-request logging.

### Failure Modes

- **High-cardinality DDoS.** The live map hits 100 000 entries and stops accepting new keys until the next rotation (NFR-2.1).
  Existing top-K entries keep accumulating, so the attacker's dominant prefixes / URIs remain visible. The cap resets every minute.
- **Collector crash.** In-flight live-map state for the current minute is lost. The next collector start resumes tailing; the
  aggregator zeroes the degraded collector's contribution after a few seconds and reintegrates it when snapshots resume.
- **Aggregator crash.** No collector is affected. The operator restarts the aggregator; backfill reconstructs the cache.
- **Frontend crash.** Stateless. Operator restarts.
- **UDP datagram loss.** Any datagram dropped in-kernel (socket buffer full, network drop) does not register as a parse failure; it
  is simply invisible. Operators should size `SO_RCVBUF` appropriately; the collector already requests 4 MiB.
- **Malformed log lines.** File format: lines with <8 tab-separated fields are silently skipped; an invalid IP also drops the line.
  UDP: packets without a recognised `v<N>\t` prefix, or with the wrong field count for the claimed version, or with a bad IP, are
  counted as received-but-not-success and dropped.
- **Clock skew between collectors.** Trend sparklines derived from merged data assume collectors are roughly NTP-synced. Per-bucket
  alignment is to the local minute / 5-minute boundary of each collector.
- **gRPC traffic over untrusted links.** The system does not ship TLS; operators should front the gRPC ports with a TLS-terminating
  proxy or an IPsec tunnel.

### Security

- **No TLS, no auth.** Deliberate (NFR-6.1). Deploy on a trusted network or behind a TLS proxy.
- **UDP bind.** Default `127.0.0.1` so merely turning on the listener does not expose a public socket (NFR-6.2).
- **Client-IP truncation.** Client addresses are truncated at ingest; the system never stores full client IPs (NFR-6.3, FR-1.3).
- **Query-string stripping.** URIs lose their query strings at ingest. A user who cares about `?q=` parameters must re-engineer
  nginx's log format — and then accept that cardinality consequence.

## Alternatives Considered

- **Log shipping to ClickHouse / ELK.** Rejected as the default: adds a storage tier to a problem that fits in a per-host 1 GB
  ring, for the target fleet size. A future ClickHouse export from the aggregator is viable and would be additive (deferred).
- **Raw request logging to Kafka.** Rejected: preserves every request at much higher cost for no visibility benefit; the operator
  wants top-K ranking, not a replay log. If raw logging is desired, nginx's own access log is the right tool.
- **Promtail / Grafana Loki.** Rejected as the primary interface. Loki is excellent for free-text log search but weak for fast
  ranked aggregations over dozens of dimensions; the drilldown interaction the operator wants fits poorly into LogQL.
- **In-process Lua aggregator on each nginx.** Considered for the collector tier. Rejected: shipping counters to a central view
  still requires a process outside nginx; keeping the ingest path out of the nginx worker avoids a class of latency regressions.
- **pull-based collector polling (aggregator polls collectors every second).** Rejected in favor of push. Polling multiplies query
  latency and makes the aggregator's cache stale by the poll interval. Push-stream with delta merge keeps the cache within seconds
  of real time.
- **One metric name for both per-host and per-source_tag roll-ups.** Rejected for Prometheus hygiene. Mixing different label sets
  under one metric name breaks aggregation rules; separate metric names (`_by_source`) are clearer and easier to query.
- **Cross-product of `host × source_tag` for every counter and histogram.** Rejected. With ~20 tags and ~50 hosts the cardinality
  explodes quickly on the duration histogram without operational benefit. The duration histogram stays per-host; requests and body
  size get a parallel `_by_source` rollup.
- **Writing every `snapshot` to disk for restart recovery.** Rejected in favor of `DumpSnapshots` RPC backfill. Disk-backed
  persistence would multiply operational surface (rotation, fsck, permissions) for a feature that needs to survive only an
  aggregator restart.

## Decisions Deferred Post-v0.2

- **ClickHouse export from aggregator.** 1-minute pre-aggregated rows pushed into a `SummingMergeTree` table for 7-day / 30-day
  windows. Frontend would route longer windows to ClickHouse while shorter windows stay on the in-memory rings. Strictly additive;
  no interface changes. Deferred until a concrete retention requirement lands.
- **TLS on gRPC endpoints.** The argument for shipping TLS changes if/when the aggregator is deployed across an untrusted network
  segment. Until then, a front proxy is the right shape.
- **Ring-buffer sizing on a per-collector basis.** Today every collector ships the same 60×50 K / 288×5 K dimensions. A
  low-traffic collector can afford smaller rings; a hot one might want larger. Deferred — the uniform default is operationally
  simpler.
- **Authenticated Prometheus scraping.** The endpoint is currently open on `:9100`. If a future deployment puts the scraper on a
  less-trusted path, scrape-side auth (bearer token, TLS client cert) is the right add-on.
- **Coarse tier beyond 24 h.** Extending to 7 days in-memory would cost ~70 MB per collector but add 2016 buckets to iterate on a
  `W24H+` query. Deferred until the operator wants a 7-day drilldown without ClickHouse.