609 lines
41 KiB
Markdown
609 lines
41 KiB
Markdown
<!-- SPDX-License-Identifier: Apache-2.0 -->
|
||
# nginx-logtail Design Document
|
||
|
||
## Metadata
|
||
|
||
| | |
|
||
| --- | --- |
|
||
| **Status** | Describes intended behavior as of `v0.2.0` |
|
||
| **Author** | Pim van Pelt `<pim@ipng.ch>` |
|
||
| **Last updated** | 2026-04-17 |
|
||
| **Audience** | Operators and contributors running real-time traffic analysis and DDoS detection across a fleet of nginx hosts |
|
||
|
||
The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and **MAY** are used as described in
|
||
[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are reserved in this document for requirements that are intended to be
|
||
enforced in code or by an external dependency. Plain-language descriptions of what the system or an operator can do are written in
|
||
lowercase — "can", "will", "does" — and should not be read as normative.
|
||
|
||
## Summary
|
||
|
||
`nginx-logtail` is a four-binary Go system for real-time analysis of nginx traffic across a fleet of hosts. Each nginx host runs a
|
||
**collector** that ingests logs (from files via `fsnotify`, from a UDP socket, or both) and maintains in-memory ranked top-K counters
|
||
across multiple time windows. A central **aggregator** subscribes to the collectors' snapshot streams and serves a merged view. An
|
||
**HTTP frontend** renders a drilldown dashboard (server-rendered HTML, zero JavaScript). A **CLI** offers the same queries as a
|
||
shell companion. All four programs speak a single gRPC service (`LogtailService`), so the frontend and CLI work against any collector
|
||
or the aggregator interchangeably.
|
||
|
||
## Background
|
||
|
||
Operators running tens of nginx hosts behind a load balancer need a live, drilldown view of request traffic for DDoS detection and
|
||
traffic analysis. Questions the system answers include:
|
||
|
||
- Which client prefix is causing the most HTTP 429s right now?
|
||
- Which website is getting the most 503s over the last 24 hours?
|
||
- Which nginx machine is the busiest?
|
||
- Is there a DDoS in progress, and from where?
|
||
|
||
Existing log-analysis pipelines (ELK, Loki, ClickHouse, etc.) answer questions like these but require infrastructure that is
|
||
disproportionate for the target workload. A handful of nginx hosts each doing ~10 K req/s at peak can be kept on a per-minute top-K
|
||
structure in ~1 GB of RAM per host, with <250 ms query latency across the whole fleet, without a storage tier.
|
||
|
||
A companion project, [`nginx-ipng-stats-plugin`](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin), adds per-device attribution in nginx
|
||
itself and can emit a logtail-format access log as UDP datagrams. `nginx-logtail` was extended in `v0.2.0` to ingest that stream
|
||
natively, so operators can run it either from on-disk log files, from the UDP feed, or both on the same host.
|
||
|
||
## Goals and Non-Goals
|
||
|
||
### Product Goals
|
||
|
||
1. **Live top-K per (website, client_prefix, URI, status, is_tor, asn, source_tag).** For every combination of these dimensions the
|
||
system maintains an integer count, ranked so that the top entries are readily available across 1 m, 5 m, 15 m, 60 m, 6 h, and 24 h
|
||
windows.
|
||
2. **Sub-second query latency.** `TopN` and `Trend` queries MUST return from the collector and from the aggregator in well under one
|
||
second at the target scale (10 hosts, 10 K req/s each).
|
||
3. **Bounded memory.** The collector MUST stay within a 1 GB steady-state memory budget regardless of input cardinality, including
|
||
during high-cardinality DDoS attacks.
|
||
4. **Two ingest paths, one data model.** On-disk log files (`fsnotify`-tailed, logrotate-aware) and UDP datagrams (from
|
||
`nginx-ipng-stats-plugin`) MUST both feed the same in-memory structure, with a single log format per path and no operator-visible
|
||
difference downstream.
|
||
5. **No external storage, no TLS, no CGO.** The entire system runs as four static Go binaries on a trusted internal network. Operators
|
||
who need retention beyond the ring buffers SHOULD scrape Prometheus.
|
||
6. **One service contract.** Collectors and the aggregator implement the same gRPC `LogtailService`. Frontend and CLI MUST work
|
||
against either interchangeably, with the collector returning "itself" from `ListTargets` and the aggregator returning its configured
|
||
collector set.
|
||
|
||
### Non-Goals
|
||
|
||
- The system does **not** parse arbitrary nginx `log_format` strings. Two fixed tab-separated formats are supported: a file format and
|
||
a UDP format (see FR-2). Operators who need general parsing should use Vector, Fluent Bit, or Promtail.
|
||
- The system does **not** store raw log lines. Counts are aggregated at ingest; the original log lines are not kept in memory or on
|
||
disk. The project does not replace an access log.
|
||
- The system does **not** persist counters across restarts. Ring buffers are in-memory only. On aggregator restart, historical state
|
||
is reconstructed by calling `DumpSnapshots` on each collector (FR-4.3). On collector restart the rings start empty and refill as new
|
||
traffic arrives.
|
||
- The system does **not** provide per-URI request timing distributions. Latency histograms exist only in the collector's Prometheus
|
||
exposition (per host), not in the top-K data model.
|
||
- The system does **not** ship TLS or authentication for its gRPC endpoints. Operators who expose it beyond a trusted network are
|
||
expected to terminate TLS in a front proxy.
|
||
- The system is **not** a general-purpose metric store. The Prometheus exporter on the collector exposes a deliberately narrow set:
|
||
per-host request counter, per-host body-size and request-time histograms, and per-`source_tag` rollup counters.
|
||
|
||
## Requirements
|
||
|
||
Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that later sections can cite it.
|
||
|
||
### Functional Requirements
|
||
|
||
**FR-1 Counter data model**
|
||
|
||
- **FR-1.1** The canonical unit of counting MUST be a 7-tuple
|
||
`(website, client_prefix, http_request_uri, http_response, is_tor, asn, ipng_source_tag)` mapped to a 64-bit integer request count.
|
||
The data model contains no other fields: no timing, no byte counts, no method (those live only in the Prometheus exposition,
|
||
FR-8).
|
||
- **FR-1.2** `website` MUST be the nginx `$host` value.
|
||
- **FR-1.3** `client_prefix` MUST be the client IP truncated to a configurable prefix length, formatted as CIDR. Default `/24` for
|
||
IPv4 and `/48` for IPv6 (flags `-v4prefix`, `-v6prefix`). Truncation happens at ingest; the original address is not retained.
|
||
- **FR-1.4** `http_request_uri` MUST be the `$request_uri` path only — the query string (from the first `?` onward) MUST be stripped
|
||
at ingest. This is the dominant cardinality-reduction measure; DDoS traffic with attacker-generated query strings cannot grow the
|
||
working set.
|
||
- **FR-1.5** `http_response` MUST be the HTTP status code as recorded by nginx.
|
||
- **FR-1.6** `is_tor` MUST be a boolean, populated by the operator in the log format (typically via a lookup against a TOR exit-node
|
||
list). For the file format, lines without this field default to `false` for backward compatibility.
|
||
- **FR-1.7** `asn` MUST be an int32 decimal value sourced from MaxMind GeoIP2 (or equivalent). For the file format, lines without
|
||
this field default to `0`.
|
||
- **FR-1.8** `ipng_source_tag` MUST be a short string identifying which attribution tag the request arrived under. For records from
|
||
on-disk log files, the collector MUST assign the tag `"direct"` (mirroring `nginx-ipng-stats-plugin`'s default-source convention). For
|
||
records from the UDP stream, the tag is taken from the log line as emitted by the plugin.
|
||
|
||
**FR-2 Log formats**
|
||
|
||
- **FR-2.1 File format.** The collector MUST accept nginx access logs in the following tab-separated layout, with the last two fields
|
||
(`is_tor`, `asn`) optional for backward compatibility:
|
||
|
||
```nginx
|
||
log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn';
|
||
```
|
||
|
||
| # | Field | Ingested into |
|
||
|---|-------------------|----------------------------|
|
||
| 0 | `$host` | `website` |
|
||
| 1 | `$remote_addr` | `client_prefix` (truncated)|
|
||
| 2 | `$msec` | (discarded) |
|
||
| 3 | `$request_method` | Prom `method` label |
|
||
| 4 | `$request_uri` | `http_request_uri` |
|
||
| 5 | `$status` | `http_response` |
|
||
| 6 | `$body_bytes_sent`| Prom body histogram |
|
||
| 7 | `$request_time` | Prom duration histogram |
|
||
| 8 | `$is_tor` | `is_tor` (optional) |
|
||
| 9 | `$asn` | `asn` (optional) |
|
||
|
||
- **FR-2.2 UDP format.** The collector MUST accept datagrams in the following tab-separated layout, as emitted by
|
||
`nginx-ipng-stats-plugin`'s `ipng_stats_logtail` directive:
|
||
|
||
```nginx
|
||
log_format ipng_stats_logtail '$host\t$remote_addr\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
|
||
```
|
||
|
||
Exactly 12 tab-separated fields are required. `$server_addr` and `$scheme` MUST be parsed but dropped; they are reserved for
|
||
future use. Malformed datagrams MUST be counted (FR-8.5) and silently dropped.
|
||
|
||
- **FR-2.3** The file tailer MUST set `source_tag="direct"` on every record it parses. The UDP listener MUST propagate
|
||
`$ipng_source_tag` verbatim. This is the only difference in downstream processing between the two ingest paths.
|
||
|
||
**FR-3 Ring buffers and time windows**
|
||
|
||
- **FR-3.1** Each collector and the aggregator MUST maintain two tiered ring buffers:
|
||
|
||
| Tier | Bucket size | Buckets | Top-K/bucket | Covers |
|
||
|--------|-------------|---------|--------------|--------|
|
||
| Fine | 1 min | 60 | 50 000 | 1 h |
|
||
| Coarse | 5 min | 288 | 5 000 | 24 h |
|
||
|
||
- **FR-3.2** The `Window` enum MUST map queries to tiers as follows:
|
||
|
||
| Window | Tier | Buckets summed |
|
||
|--------|--------|----------------|
|
||
| 1 m | fine | 1 |
|
||
| 5 m | fine | 5 |
|
||
| 15 m | fine | 15 |
|
||
| 60 m | fine | 60 |
|
||
| 6 h | coarse | 72 |
|
||
| 24 h | coarse | 288 |
|
||
|
||
- **FR-3.3** Every minute, the collector MUST snapshot its live map into the fine ring (top-50 000, sorted desc) and reset the live
|
||
map. Every fifth fine tick, the collector MUST merge the most recent five fine snapshots into one coarse snapshot (top-5 000).
|
||
The fine/coarse merge MUST be pinned to the 1-minute and 5-minute boundaries of the local clock so sparklines align across
|
||
collectors.
|
||
- **FR-3.4** Querying MUST always read from the rings, never from the live map. A sub-minute request MUST return an empty top-1
|
||
result rather than surfacing partially-accumulated data; this keeps per-minute results monotonic.
|
||
|
||
**FR-4 Push-based streaming and aggregation**
|
||
|
||
- **FR-4.1** The collector MUST expose a server-streaming RPC `StreamSnapshots(SnapshotRequest) → stream Snapshot` that emits one fine
|
||
(1-min) snapshot per minute rotation. Subscribers MUST receive the same snapshot independently (per-subscriber buffered fan-out,
|
||
bounded buffer, drop on full).
|
||
- **FR-4.2** The aggregator MUST subscribe to every configured collector via `StreamSnapshots` and merge snapshots into a single
|
||
ring-buffer cache. The merge strategy MUST be delta-based: on each new snapshot from collector `X`, the aggregator MUST subtract
|
||
`X`'s previous contribution and add the new entries, giving `O(snapshot_size)` per update (not `O(N_collectors × size)`).
|
||
- **FR-4.3** The aggregator MUST expose a unary `DumpSnapshots(DumpSnapshotsRequest) → stream Snapshot` on each collector that
|
||
streams all fine buckets (with `is_coarse=false`) followed by all coarse buckets (with `is_coarse=true`). On aggregator startup, it
|
||
MUST call `DumpSnapshots` against every collector once (concurrently, after its own gRPC server is already listening), merge the
|
||
per-timestamp entries the same way the live path does, and load the result into its cache via a single atomic replacement.
|
||
Collectors that return `Unimplemented` MUST be skipped without blocking live streaming from the others.
|
||
- **FR-4.4** The aggregator MUST reconnect to each collector independently with exponential backoff (100 ms → cap 30 s). After three
|
||
consecutive connection failures the aggregator MUST zero the degraded collector's contribution (subtract its last-known snapshot
|
||
and delete its entry). When the collector recovers and sends a new snapshot, its contribution MUST automatically be reintegrated.
|
||
|
||
**FR-5 Query service (`LogtailService`)**
|
||
|
||
- **FR-5.1** Collector and aggregator MUST implement the same gRPC `LogtailService`:
|
||
|
||
```protobuf
|
||
service LogtailService {
|
||
rpc TopN(TopNRequest) returns (TopNResponse);
|
||
rpc Trend(TrendRequest) returns (TrendResponse);
|
||
rpc StreamSnapshots(SnapshotRequest) returns (stream Snapshot);
|
||
rpc ListTargets(ListTargetsRequest) returns (ListTargetsResponse);
|
||
rpc DumpSnapshots(DumpSnapshotsRequest)returns (stream Snapshot);
|
||
}
|
||
```
|
||
|
||
- **FR-5.2** `Filter` MUST support exact, inequality, and RE2-regex constraints on the dimensions of FR-1. Status and ASN accept
|
||
the six-operator expression language (`=`, `!=`, `>`, `>=`, `<`, `<=`). Website and URI accept regex match and regex exclusion.
|
||
TOR filtering uses a three-state enum (`ANY`/`YES`/`NO`). Source-tag filtering is exact match only.
|
||
- **FR-5.3** `GroupBy` MUST cover every dimension of FR-1 except `is_tor` (which is boolean and rarely useful as a group-by target):
|
||
`WEBSITE`, `CLIENT_PREFIX`, `REQUEST_URI`, `HTTP_RESPONSE`, `ASN_NUMBER`, `SOURCE_TAG`.
|
||
- **FR-5.4** `ListTargets` MUST return, from the aggregator, every configured collector with its display name and gRPC address; from
|
||
a collector, a single entry describing itself with an empty `addr` (meaning "this endpoint").
|
||
- **FR-5.5** All queries MUST be answered from the local ring buffers. The aggregator MUST NOT fan out to collectors at query time.
|
||
|
||
**FR-6 HTTP frontend**
|
||
|
||
- **FR-6.1** The frontend MUST render a server-rendered HTML dashboard with no JavaScript, using inline SVG for sparklines and
|
||
`<meta http-equiv="refresh">` for auto-refresh. It MUST work in text-mode browsers (w3m, lynx) and under `curl`.
|
||
- **FR-6.2** All filter, group-by, and window state MUST live in the URL query string so that URLs are shareable and bookmarkable.
|
||
No server-side session.
|
||
- **FR-6.3** The frontend MUST provide a drilldown affordance: clicking a row MUST add that row's value as a filter and advance the
|
||
group-by dimension through the cycle
|
||
`website → prefix → uri → status → asn → source_tag → website`.
|
||
- **FR-6.4** The frontend MUST issue `TopN`, `Trend`, and `ListTargets` concurrently with a 5 s deadline. `Trend` failure MUST
|
||
suppress the sparkline but not the table. `ListTargets` failure MUST hide the source picker but not the rest of the page.
|
||
- **FR-6.5** Appending `&raw=1` to any URL MUST return the `TopN` result as JSON, so the dashboard can be scripted without the CLI.
|
||
- **FR-6.6** The frontend MUST accept a `q=` parameter holding a mini filter expression (`status>=400 AND website~=gouda.*`). On
|
||
submission it MUST parse the expression and redirect to the canonical URL with the individual `f_*` params populated; parse errors
|
||
MUST render inline without losing the current filter state.
|
||
|
||
**FR-7 CLI**
|
||
|
||
- **FR-7.1** The CLI MUST provide four subcommands: `topn`, `trend`, `stream`, `targets`. Each subcommand MUST accept
|
||
`--target host:port[,host:port...]` and fan out concurrently, printing results in order with per-target headers (omitted for
|
||
single-target invocations, so output pipes cleanly into `jq`).
|
||
- **FR-7.2** The CLI MUST expose every `Filter` dimension as a dedicated flag and default to a human-readable table. `--json` MUST
|
||
switch to newline-delimited JSON for `stream` and to a single JSON array for `topn`/`trend`.
|
||
- **FR-7.3** `stream` MUST reconnect automatically on error with a 5 s backoff and run until interrupted.
|
||
|
||
**FR-8 Prometheus exposition (collector only)**
|
||
|
||
- **FR-8.1** The collector MUST expose a Prometheus `/metrics` endpoint on `-prom-listen` (default `:9100`). Setting the flag to the
|
||
empty string MUST disable it entirely.
|
||
- **FR-8.2** The collector MUST expose a per-request counter `nginx_http_requests_total{host, method, status}` capped at
|
||
`promCounterCap = 250 000` distinct label sets. When the cap is reached, further new label sets MUST be dropped (existing series
|
||
keep incrementing) until the map is rolled over.
|
||
- **FR-8.3** The collector MUST expose per-host histograms
|
||
`nginx_http_response_body_bytes{host, le}` (body-size distribution) and
|
||
`nginx_http_request_duration_seconds{host, le}` (request-time distribution). The duration histogram MUST NOT be split by
|
||
`source_tag` — its bucket count would multiply without operational benefit.
|
||
- **FR-8.4** The collector MUST expose two parallel roll-ups labeled by `source_tag` only (not cross-producted with host):
|
||
`nginx_http_requests_by_source_total{source_tag}` and
|
||
`nginx_http_response_body_bytes_by_source{source_tag, le}`. These are separate metric names to avoid inconsistent label sets
|
||
under a single name.
|
||
- **FR-8.5** The collector MUST expose three counters that let operators distinguish UDP parse failures from back-pressure drops:
|
||
`logtail_udp_packets_received_total` (datagrams off the socket),
|
||
`logtail_udp_loglines_success_total` (parsed OK), and
|
||
`logtail_udp_loglines_consumed_total` (forwarded to the store — i.e. not dropped).
|
||
|
||
### Non-Functional Requirements
|
||
|
||
**NFR-1 Correctness under concurrency**
|
||
|
||
- **NFR-1.1** The collector MUST run a single goroutine ("the store goroutine") that owns the live map and the ring-buffer write
|
||
path. No other goroutine MUST write to these structures. The file tailer and the UDP listener MUST communicate with the store
|
||
goroutine through a bounded channel.
|
||
- **NFR-1.2** Readers (query RPCs and subscriber fan-out) MUST take an `RLock` on the rings. Writers MUST take a `Lock` only for the
|
||
moment the slice header of the new snapshot is installed; serialisation and network I/O MUST happen outside the lock.
|
||
- **NFR-1.3** `DumpSnapshots` MUST copy ring headers and filled counts under `RLock` only, then release the lock before streaming.
|
||
The minute-rotation write path MUST never observe a lock held for longer than a microsecond-scale slice copy.
|
||
- **NFR-1.4** A query that races with a rotation MUST observe a monotonically non-decreasing total for a fixed filter over a fixed
|
||
window; it MUST NOT observe a partially-rotated state that would cause a total to decrease compared to a prior reading.
|
||
|
||
**NFR-2 Memory bounds**
|
||
|
||
- **NFR-2.1** The collector's live map MUST be hard-capped at 100 000 entries. Once the cap is reached, only updates to existing keys
|
||
MUST proceed; new keys MUST be dropped until the next minute rotation resets the map. This bounds memory under high-cardinality
|
||
attacks.
|
||
- **NFR-2.2** Fine-ring snapshots MUST be capped at top-50 000 entries; coarse-ring snapshots at top-5 000. The full memory budget
|
||
for a collector is therefore approximately 845 MB (live map ~19 MB + fine ring ~558 MB + coarse ring ~268 MB).
|
||
- **NFR-2.3** The aggregator MUST apply the same tier caps as the collector. Its steady-state memory is roughly equivalent to one
|
||
collector regardless of the number of collectors subscribed.
|
||
- **NFR-2.4** The Prometheus counter map (FR-8.2) MUST be capped at `promCounterCap = 250 000` entries. The per-host and per-source
|
||
histograms MUST NOT be capped explicitly — they grow only with the distinct host count, which is bounded by the operator's vhost
|
||
configuration.
|
||
|
||
**NFR-3 Performance**
|
||
|
||
- **NFR-3.1** `ParseLine` and `ParseUDPLine` MUST use `strings.Split` / `strings.SplitN` (no regex), so that per-line cost stays
|
||
around 50 ns on commodity hardware.
|
||
- **NFR-3.2** `TopN` and `Trend` queries across the full 24-hour coarse ring MUST complete in well under 250 ms at the 50 000-entry
|
||
fine cap, for fully-specified filters.
|
||
- **NFR-3.3** The collector's input channel MUST be sized to absorb approximately 20 s of peak load (e.g. 200 000 at 10 K lines/s)
|
||
so that transient pauses in the store goroutine do not back up the tailer or the UDP listener.
|
||
- **NFR-3.4** When either the tailer or the UDP listener cannot enqueue a parsed record because the channel is full, the record
|
||
MUST be dropped rather than blocking the ingest goroutine. UDP drops MUST be visible via the counters in FR-8.5; file-path drops
|
||
are implicit (the tailer falls behind the file).
|
||
|
||
**NFR-4 Fault tolerance and recovery**
|
||
|
||
- **NFR-4.1** The file tailer MUST tolerate logrotate automatically. On `RENAME`/`REMOVE` events it MUST drain the old file
|
||
descriptor to EOF, close it, and retry opening the original path with exponential backoff until the new file appears. No SIGHUP or
|
||
restart MUST be required.
|
||
- **NFR-4.2** The aggregator MUST NOT block frontend queries during backfill. Its gRPC server MUST start listening first; backfill
|
||
(FR-4.3) MUST run in a background goroutine.
|
||
- **NFR-4.3** A collector restart MUST NOT affect peer collectors or the aggregator's ability to continue serving the surviving
|
||
collectors' data. When the restarted collector reconnects, its stream MUST resume without operator action.
|
||
- **NFR-4.4** An aggregator restart MUST recover its ring-buffer contents from all collectors via `DumpSnapshots`; live streaming
|
||
MUST resume in parallel with backfill so that no minute is lost even during a restart.
|
||
|
||
**NFR-5 Observability of the system itself**
|
||
|
||
- **NFR-5.1** The collector MUST expose operator-facing log lines on stdout covering: file discovery, logrotate reopen events, UDP
|
||
listener bind, subscriber connect/disconnect, and fatal configuration errors. The collector MUST NOT log anything on the per-request
|
||
hot path.
|
||
- **NFR-5.2** The aggregator MUST log each collector's connect, disconnect, degraded transition, and recovery. Backfill MUST log a
|
||
per-collector line with bucket counts, entry counts, and wall-clock duration.
|
||
- **NFR-5.3** The Prometheus exporter MUST be the primary out-of-band health signal. Counters FR-8.5 plus the per-host request
|
||
counter (FR-8.2) give an operator a full view of ingest health without needing to read the logs.
|
||
|
||
**NFR-6 Security**
|
||
|
||
- **NFR-6.1** gRPC traffic MUST be cleartext HTTP/2. Operators who expose the endpoints beyond a trusted network are expected to
|
||
terminate TLS in a front proxy.
|
||
- **NFR-6.2** The collector MUST bind its UDP listener to `127.0.0.1` by default (configurable via `-logtail-bind`) so that merely
|
||
setting `-logtail-port` MUST NOT expose the socket to the public Internet.
|
||
- **NFR-6.3** The system MUST NOT record per-request personally-identifying data beyond what nginx already logs. Client IPs are
|
||
truncated at ingest (FR-1.3); URIs lose their query strings (FR-1.4).
|
||
|
||
**NFR-7 Documentation and packaging**
|
||
|
||
- **NFR-7.1** The repository MUST ship `docs/user-guide.md` that walks an operator through nginx log format configuration, running
|
||
each of the four binaries (flags, systemd examples, Docker Compose), and integrating the Prometheus exporter. It MUST contain
|
||
enough examples that a new operator can stand up a single-host deployment end-to-end without reading the source.
|
||
- **NFR-7.2** The repository MUST ship `docs/design.md` (this document) covering the normative requirements and the architectural
|
||
rationale.
|
||
- **NFR-7.3** All four binaries MUST build as static Go binaries with `CGO_ENABLED=0 -trimpath -ldflags="-s -w"` and MUST ship
|
||
together in a single `scratch`-based Docker image. No OS, no shell, no runtime dependencies.
|
||
|
||
## Architecture Overview
|
||
|
||
### Process Model
|
||
|
||
The project ships four binaries:
|
||
|
||
- **`collector`** — runs on every nginx host. Ingests logs from files and/or UDP, maintains the live map and tiered rings, serves
|
||
`LogtailService` on port 9090, and exposes Prometheus on port 9100.
|
||
- **`aggregator`** — runs centrally. Subscribes to every collector, merges snapshots, serves the same `LogtailService` on port 9091.
|
||
- **`frontend`** — runs centrally, alongside the aggregator. HTTP server on port 8080, rendering HTML against the aggregator (or any
|
||
other `LogtailService` endpoint).
|
||
- **`cli`** — runs wherever the operator is. Talks to any `LogtailService`. No daemon.
|
||
|
||
Because all four binaries speak one service, the aggregator is optional for a single-host deployment: the frontend and CLI can point
|
||
directly at a collector.
|
||
|
||
### Data Flow
|
||
|
||
```
|
||
┌──────────────┐ files ┌───────────────┐
|
||
nginx ──▶ │ access.log │───────▶│ file tailer │
|
||
│ (file mode) │ │ (fsnotify) │──┐
|
||
└──────────────┘ └───────────────┘ │
|
||
│
|
||
┌──────────────┐ UDP ┌───────────────┐ │
|
||
nginx-ipng ▶ ipng_stats_ ├───────▶│ udp listener │──┼──▶ LogRecord ──▶ ┌──────────┐
|
||
-stats- │ logtail │ │ (127.0.0.1) │ │ channel (200K)│ store │
|
||
plugin └──────────────┘ └───────────────┘ │ │ goroutine│
|
||
│ └─────┬────┘
|
||
▼ │
|
||
Prom exporter │
|
||
▼
|
||
┌─────────────┐
|
||
│ live map │
|
||
│ (≤100 K) │
|
||
└──────┬──────┘
|
||
│ every 1 m
|
||
▼
|
||
┌─────────────┐
|
||
│ fine ring │
|
||
│ 60×50 K │────┐
|
||
└──────┬──────┘ │
|
||
│ every 5 m │
|
||
▼ │
|
||
┌─────────────┐ │
|
||
│ coarse ring │ │
|
||
│ 288×5 K │ │
|
||
└─────────────┘ │
|
||
│
|
||
┌──────────────────────────────────────┘
|
||
│ StreamSnapshots (push)
|
||
▼
|
||
aggregator ──▶ merged cache ──▶ frontend / CLI
|
||
```
|
||
|
||
Requests enter nginx. The nginx writes either to a log file (file mode) or via the `ipng_stats_logtail` directive to a UDP socket
|
||
(UDP mode), or both. The collector has two ingest goroutines that parse a line into a `LogRecord` and enqueue it on a shared 200 K
|
||
channel. A single store goroutine consumes the channel, updating the live map and maintaining the tiered rings. A once-per-minute
|
||
timer rotates the live map into the fine ring and (every fifth tick) into the coarse ring, and fans the fresh snapshot out to every
|
||
`StreamSnapshots` subscriber. The aggregator is one such subscriber.
|
||
|
||
Query RPCs (`TopN`, `Trend`) MUST read only from the rings and MUST NOT read from the live map. The aggregator's cache is itself a
|
||
ring built from the merged-view snapshots; it is updated on the same 1-minute cadence regardless of how many collectors are
|
||
connected.
|
||
|
||
## Components
|
||
|
||
### Program 1 — Collector (`cmd/collector`)
|
||
|
||
#### Responsibilities
|
||
|
||
- Tail on-disk log files via a single `fsnotify.Watcher`, handle logrotate, and re-scan glob patterns periodically to pick up new
|
||
files (FR-2.1, NFR-4.1).
|
||
- Listen on an optional UDP socket for `ipng_stats_logtail` datagrams (FR-2.2).
|
||
- Parse each log line into a `LogRecord` (FR-1).
|
||
- Maintain the live map, fine ring, coarse ring, and subscriber fan-out under a single-writer goroutine (FR-3, NFR-1).
|
||
- Serve `LogtailService` on `-listen` (FR-5).
|
||
- Expose Prometheus metrics on `-prom-listen` (FR-8).
|
||
|
||
#### Key data types
|
||
|
||
- `LogRecord` — ten fields (website, client_prefix, URI, status, is_tor, asn, method, body_bytes_sent, request_time, source_tag).
|
||
Produced by `ParseLine` or `ParseUDPLine` and consumed by the store goroutine.
|
||
- `Tuple6` (historical name; carries seven fields now) — the aggregation key. NUL-separated when encoded as a map key for snapshots.
|
||
The code name is intentionally stable so downstream tests and consumers are not churned.
|
||
- `Snapshot` — `(timestamp, []Entry)` where `Entry = (label, count)` and `label` is an encoded `Tuple6`.
|
||
|
||
#### Presents
|
||
|
||
- `LogtailService` on TCP (default `:9090`).
|
||
- A Prometheus `/metrics` handler on TCP (default `:9100`).
|
||
|
||
#### Consumes
|
||
|
||
- One or more on-disk log files matched by `--logs` and/or `--logs-file` globs.
|
||
- Optionally, a UDP socket on `--logtail-bind:--logtail-port` (default `127.0.0.1`, disabled when port is `0`).
|
||
|
||
### Program 2 — Aggregator (`cmd/aggregator`)
|
||
|
||
#### Responsibilities
|
||
|
||
- Dial every configured collector and subscribe via `StreamSnapshots` (FR-4.2).
|
||
- Merge incoming snapshots into a single cache using delta-based subtraction, so a collector's contribution is updated in place
|
||
rather than accumulated (FR-4.2).
|
||
- At startup, call `DumpSnapshots` on each collector once, merge the per-timestamp entries, and load the result into the cache
|
||
atomically (FR-4.3).
|
||
- Handle collector outages with exponential-backoff reconnect and degraded-collector zeroing (FR-4.4).
|
||
- Serve the same `LogtailService` as the collector (FR-5).
|
||
- Maintain a `TargetRegistry` that maps collector addresses to display names (updated from the `source` field of incoming
|
||
snapshots).
|
||
|
||
#### Presents
|
||
|
||
- `LogtailService` on TCP (default `:9091`).
|
||
|
||
#### Consumes
|
||
|
||
- The `StreamSnapshots` and `DumpSnapshots` RPCs on every configured collector (`--collectors`).
|
||
|
||
### Program 3 — Frontend (`cmd/frontend`)
|
||
|
||
#### Responsibilities
|
||
|
||
- Render the drilldown dashboard server-side with no JavaScript (FR-6.1).
|
||
- Parse URL query string into filter / group-by / window state (FR-6.2).
|
||
- Issue `TopN`, `Trend`, and `ListTargets` concurrently with a 5 s deadline (FR-6.4).
|
||
- Render inline SVG sparklines from `TrendResponse` (FR-6.1).
|
||
- Support the mini filter-expression language (FR-6.6) and the `raw=1` JSON output (FR-6.5).
|
||
- Expose a source-picker row populated from `ListTargets`.
|
||
|
||
#### Presents
|
||
|
||
- An HTTP dashboard on TCP (default `:8080`).
|
||
|
||
#### Consumes
|
||
|
||
- Any `LogtailService` endpoint (`--target`, default `localhost:9091` — the aggregator).
|
||
|
||
### Program 4 — CLI (`cmd/cli`)
|
||
|
||
#### Responsibilities
|
||
|
||
- Dispatch to `topn`, `trend`, `stream`, or `targets` (FR-7.1).
|
||
- Parse shared and per-subcommand flags, build a `Filter` proto from them, and fan out to every `--target` concurrently (FR-7.2).
|
||
- Print human-readable tables by default; switch to JSON with `--json` (FR-7.2).
|
||
- Reconnect automatically in `stream` mode (FR-7.3).
|
||
|
||
#### Presents
|
||
|
||
- Exit status `0` on success, non-zero on RPC error (except `stream`, which runs until interrupted).
|
||
|
||
#### Consumes
|
||
|
||
- Any `LogtailService` endpoint.
|
||
|
||
### Protobuf service (`proto/logtail.proto`)
|
||
|
||
One proto file defines every shared type: `Tuple6` is encoded as a NUL-separated label string inside `TopNEntry`, and the
|
||
`Snapshot` message carries both fine (1-min) and coarse (5-min) ring contents. `GroupBy` and `Window` are enums; `Filter` carries
|
||
optional exact-match fields, regex fields, and the `StatusOp` comparison enum used for both `http_response` and `asn_number`.
|
||
|
||
## Operational Concerns
|
||
|
||
### Deployment Topology
|
||
|
||
A typical deployment is:
|
||
|
||
- **Per nginx host:** one `collector` systemd unit, pointed at `/var/log/nginx/*.log` and/or listening on `127.0.0.1:9514` for the
|
||
`nginx-ipng-stats-plugin` UDP stream. Exposes `:9090` (gRPC) and `:9100` (Prometheus).
|
||
- **Central:** one `aggregator` systemd unit on e.g. `agg:9091`, subscribed to all collectors; and one `frontend` systemd unit on
|
||
`agg:8080`, pointed at the aggregator. Operators reach the dashboard via `http://agg:8080/`. Alternatively, the Docker Compose
|
||
file in the repo root runs the aggregator and frontend together.
|
||
- **Operator laptop:** `logtail-cli` invocations, pointed at the aggregator for fleet-wide questions or at a specific collector for
|
||
a single-host drilldown.
|
||
|
||
### Configuration
|
||
|
||
All four binaries are configured via flags with matching environment variables. The canonical reference is `docs/user-guide.md`.
|
||
Representative settings:
|
||
|
||
- `collector`: `--logs /var/log/nginx/*.log`, `--logtail-port 9514`, `--source $(hostname)`, `--prom-listen :9100`.
|
||
- `aggregator`: `--collectors nginx1:9090,nginx2:9090`, `--listen :9091`.
|
||
- `frontend`: `--target agg:9091`, `--listen :8080`.
|
||
- `cli`: no persistent configuration; every invocation carries `--target`.
|
||
|
||
### Reload and Restart Semantics
|
||
|
||
- **Collector restart.** The live map and both rings start empty. The file tailer resumes at EOF of each watched file (no historical
|
||
replay). The fine ring refills within an hour; the coarse ring within 24 hours.
|
||
- **Aggregator restart.** Backfill reconstructs the cache from all collectors' `DumpSnapshots` streams. The gRPC server is listening
|
||
before backfill begins (NFR-4.2), so the frontend is never blocked during restart — it just sees an incomplete cache for the few
|
||
seconds backfill takes.
|
||
- **Collector outage.** The aggregator reconnects with backoff; after three consecutive failures the collector's contribution is
|
||
zeroed (FR-4.4) so the merged view does not show stale counts. On recovery the zeroing is reversed by the next snapshot.
|
||
- **nginx logrotate.** The collector drains the old fd, closes, and retries the original path. No operator action (NFR-4.1).
|
||
- **nginx-ipng-stats-plugin reload.** The plugin's UDP socket is per-worker; a reload simply causes new workers to open fresh
|
||
sockets to the same address. The collector sees a brief gap and resumes.
|
||
|
||
### Observability of the System Itself
|
||
|
||
Primary channel is the collector's Prometheus endpoint (FR-8). Beyond the per-host request counter and the per-source roll-ups,
|
||
three UDP counters give direct visibility into the UDP ingest path:
|
||
|
||
- `logtail_udp_packets_received_total` — what arrived.
|
||
- `logtail_udp_loglines_success_total` — what parsed cleanly.
|
||
- `logtail_udp_loglines_consumed_total` — what made it to the store (i.e. was not dropped by a full channel).
|
||
|
||
`received - success` is the parse-failure rate; `success - consumed` is the back-pressure drop rate. Operators should alert on both
|
||
being non-zero.
|
||
|
||
Each binary logs human-readable lines on stdout for connect/disconnect events, logrotate reopen, backfill timing, and degraded
|
||
transitions. No per-request logging.
|
||
|
||
### Failure Modes
|
||
|
||
- **High-cardinality DDoS.** The live map hits 100 000 entries and stops accepting new keys until the next rotation (NFR-2.1).
|
||
Existing top-K entries keep accumulating, so the attacker's dominant prefixes / URIs remain visible. The cap resets every minute.
|
||
- **Collector crash.** In-flight live-map state for the current minute is lost. The next collector start resumes tailing; the
|
||
aggregator zeroes the degraded collector's contribution after a few seconds and reintegrates it when snapshots resume.
|
||
- **Aggregator crash.** No collector is affected. The operator restarts the aggregator; backfill reconstructs the cache.
|
||
- **Frontend crash.** Stateless. Operator restarts.
|
||
- **UDP datagram loss.** Any datagram dropped in-kernel (socket buffer full, network drop) does not register as a parse failure; it
|
||
is simply invisible. Operators should size `SO_RCVBUF` appropriately; the collector already requests 4 MiB.
|
||
- **Malformed log lines.** File format: lines with <8 tab-separated fields are silently skipped; an invalid IP also drops the line.
|
||
UDP: packets without exactly 12 fields are counted as received-but-not-success and dropped.
|
||
- **Clock skew between collectors.** Trend sparklines derived from merged data assume collectors are roughly NTP-synced. Per-bucket
|
||
alignment is to the local minute / 5-minute boundary of each collector.
|
||
- **gRPC traffic over untrusted links.** The system does not ship TLS; operators should front the gRPC ports with a TLS-terminating
|
||
proxy or an IPsec tunnel.
|
||
|
||
### Security
|
||
|
||
- **No TLS, no auth.** Deliberate (NFR-6.1). Deploy on a trusted network or behind a TLS proxy.
|
||
- **UDP bind.** Default `127.0.0.1` so merely turning on the listener does not expose a public socket (NFR-6.2).
|
||
- **Client-IP truncation.** Client addresses are truncated at ingest; the system never stores full client IPs (NFR-6.3, FR-1.3).
|
||
- **Query-string stripping.** URIs lose their query strings at ingest. A user who cares about `?q=` parameters must re-engineer
|
||
nginx's log format — and then accept that cardinality consequence.
|
||
|
||
## Alternatives Considered
|
||
|
||
- **Log shipping to ClickHouse / ELK.** Rejected as the default: adds a storage tier to a problem that fits in a per-host 1 GB
|
||
ring, for the target fleet size. A future ClickHouse export from the aggregator is viable and would be additive (deferred).
|
||
- **Raw request logging to Kafka.** Rejected: preserves every request at much higher cost for no visibility benefit; the operator
|
||
wants top-K ranking, not a replay log. If raw logging is desired, nginx's own access log is the right tool.
|
||
- **Promtail / Grafana Loki.** Rejected as the primary interface. Loki is excellent for free-text log search but weak for fast
|
||
ranked aggregations over dozens of dimensions; the drilldown interaction the operator wants fits poorly into LogQL.
|
||
- **In-process Lua aggregator on each nginx.** Considered for the collector tier. Rejected: shipping counters to a central view
|
||
still requires a process outside nginx; keeping the ingest path out of the nginx worker avoids a class of latency regressions.
|
||
- **pull-based collector polling (aggregator polls collectors every second).** Rejected in favor of push. Polling multiplies query
|
||
latency and makes the aggregator's cache stale by the poll interval. Push-stream with delta merge keeps the cache within seconds
|
||
of real time.
|
||
- **One metric name for both per-host and per-source_tag roll-ups.** Rejected for Prometheus hygiene. Mixing different label sets
|
||
under one metric name breaks aggregation rules; separate metric names (`_by_source`) are clearer and easier to query.
|
||
- **Cross-product of `host × source_tag` for every counter and histogram.** Rejected. With ~20 tags and ~50 hosts the cardinality
|
||
explodes quickly on the duration histogram without operational benefit. The duration histogram stays per-host; requests and body
|
||
size get a parallel `_by_source` rollup.
|
||
- **Writing every `snapshot` to disk for restart recovery.** Rejected in favor of `DumpSnapshots` RPC backfill. Disk-backed
|
||
persistence would multiply operational surface (rotation, fsck, permissions) for a feature that needs to survive only an
|
||
aggregator restart.
|
||
|
||
## Decisions Deferred Post-v0.2
|
||
|
||
- **ClickHouse export from aggregator.** 1-minute pre-aggregated rows pushed into a `SummingMergeTree` table for 7-day / 30-day
|
||
windows. Frontend would route longer windows to ClickHouse while shorter windows stay on the in-memory rings. Strictly additive;
|
||
no interface changes. Deferred until a concrete retention requirement lands.
|
||
- **TLS on gRPC endpoints.** The argument for shipping TLS changes if/when the aggregator is deployed across an untrusted network
|
||
segment. Until then, a front proxy is the right shape.
|
||
- **Ring-buffer sizing on a per-collector basis.** Today every collector ships the same 60×50 K / 288×5 K dimensions. A
|
||
low-traffic collector can afford smaller rings; a hot one might want larger. Deferred — the uniform default is operationally
|
||
simpler.
|
||
- **Authenticated Prometheus scraping.** The endpoint is currently open on `:9100`. If a future deployment puts the scraper on a
|
||
less-trusted path, scrape-side auth (bearer token, TLS client cert) is the right add-on.
|
||
- **Coarse tier beyond 24 h.** Extending to 7 days in-memory would cost ~70 MB per collector but add 2016 buckets to iterate on a
|
||
`W24H+` query. Deferred until the operator wants a 7-day drilldown without ClickHouse.
|