6647f95be4
Wire-format and metric overhaul. Both file and UDP ingest now share one
versioned ParseLine that dispatches on the v<N>\t prefix; v1 stays
unchanged, v2 adds $bytes_sent (replacing $body_bytes_sent),
$request_length, $upstream_response_time, and $upstream_status. File
ingest gains the same versioning, and the legacy positional file format
is removed (no live deployments).
Prometheus exposition is rewritten:
- nginx_http_bytes_sent and nginx_http_request_duration_seconds gain
a source_tag label.
- nginx_http_requests_by_source_total gains status_class.
- New v2-only metrics: nginx_http_request_bytes,
nginx_http_upstream_duration_seconds,
nginx_http_upstream_requests_total{status_class}.
- Dropped nginx_http_response_body_bytes_by_source (subsumed by the
dual-labeled bytes_sent metric).
Adds 'make fixstyle' (gofmt -w) and clears all golangci-lint findings
across the repo (errcheck, S1001, ST1005, unused).
Docs in design.md FR-2/FR-8 and user-guide.md are rewritten to present
v2 as the recommended log format.
666 lines
45 KiB
Markdown
666 lines
45 KiB
Markdown
<!-- SPDX-License-Identifier: Apache-2.0 -->
|
||
# nginx-logtail Design Document
|
||
|
||
## Metadata
|
||
|
||
| | |
|
||
| --- | --- |
|
||
| **Status** | Describes intended behavior as of `v0.2.0` |
|
||
| **Author** | Pim van Pelt `<pim@ipng.ch>` |
|
||
| **Last updated** | 2026-04-17 |
|
||
| **Audience** | Operators and contributors running real-time traffic analysis and DDoS detection across a fleet of nginx hosts |
|
||
|
||
The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and **MAY** are used as described in
|
||
[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are reserved in this document for requirements that are intended to be
|
||
enforced in code or by an external dependency. Plain-language descriptions of what the system or an operator can do are written in
|
||
lowercase — "can", "will", "does" — and should not be read as normative.
|
||
|
||
## Summary
|
||
|
||
`nginx-logtail` is a four-binary Go system for real-time analysis of nginx traffic across a fleet of hosts. Each nginx host runs a
|
||
**collector** that ingests logs (from files via `fsnotify`, from a UDP socket, or both) and maintains in-memory ranked top-K counters
|
||
across multiple time windows. A central **aggregator** subscribes to the collectors' snapshot streams and serves a merged view. An
|
||
**HTTP frontend** renders a drilldown dashboard (server-rendered HTML, zero JavaScript). A **CLI** offers the same queries as a
|
||
shell companion. All four programs speak a single gRPC service (`LogtailService`), so the frontend and CLI work against any collector
|
||
or the aggregator interchangeably.
|
||
|
||
## Background
|
||
|
||
Operators running tens of nginx hosts behind a load balancer need a live, drilldown view of request traffic for DDoS detection and
|
||
traffic analysis. Questions the system answers include:
|
||
|
||
- Which client prefix is causing the most HTTP 429s right now?
|
||
- Which website is getting the most 503s over the last 24 hours?
|
||
- Which nginx machine is the busiest?
|
||
- Is there a DDoS in progress, and from where?
|
||
|
||
Existing log-analysis pipelines (ELK, Loki, ClickHouse, etc.) answer questions like these but require infrastructure that is
|
||
disproportionate for the target workload. A handful of nginx hosts each doing ~10 K req/s at peak can be kept on a per-minute top-K
|
||
structure in ~1 GB of RAM per host, with <250 ms query latency across the whole fleet, without a storage tier.
|
||
|
||
A companion project, [`nginx-ipng-stats-plugin`](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin), adds per-device attribution in nginx
|
||
itself and can emit a logtail-format access log as UDP datagrams. `nginx-logtail` was extended in `v0.2.0` to ingest that stream
|
||
natively, so operators can run it either from on-disk log files, from the UDP feed, or both on the same host.
|
||
|
||
## Goals and Non-Goals
|
||
|
||
### Product Goals
|
||
|
||
1. **Live top-K per (website, client_prefix, URI, status, is_tor, asn, source_tag).** For every combination of these dimensions the
|
||
system maintains an integer count, ranked so that the top entries are readily available across 1 m, 5 m, 15 m, 60 m, 6 h, and 24 h
|
||
windows.
|
||
2. **Sub-second query latency.** `TopN` and `Trend` queries MUST return from the collector and from the aggregator in well under one
|
||
second at the target scale (10 hosts, 10 K req/s each).
|
||
3. **Bounded memory.** The collector MUST stay within a 1 GB steady-state memory budget regardless of input cardinality, including
|
||
during high-cardinality DDoS attacks.
|
||
4. **Two ingest paths, one data model.** On-disk log files (`fsnotify`-tailed, logrotate-aware) and UDP datagrams (from
|
||
`nginx-ipng-stats-plugin`) MUST both feed the same in-memory structure, with a single log format per path and no operator-visible
|
||
difference downstream.
|
||
5. **No external storage, no TLS, no CGO.** The entire system runs as four static Go binaries on a trusted internal network. Operators
|
||
who need retention beyond the ring buffers SHOULD scrape Prometheus.
|
||
6. **One service contract.** Collectors and the aggregator implement the same gRPC `LogtailService`. Frontend and CLI MUST work
|
||
against either interchangeably, with the collector returning "itself" from `ListTargets` and the aggregator returning its configured
|
||
collector set.
|
||
|
||
### Non-Goals
|
||
|
||
- The system does **not** parse arbitrary nginx `log_format` strings. A single versioned tab-separated format is
|
||
supported on both file and UDP ingest (see FR-2). Operators who need general parsing should use Vector, Fluent Bit, or
|
||
Promtail.
|
||
- The system does **not** store raw log lines. Counts are aggregated at ingest; the original log lines are not kept in memory or on
|
||
disk. The project does not replace an access log.
|
||
- The system does **not** persist counters across restarts. Ring buffers are in-memory only. On aggregator restart, historical state
|
||
is reconstructed by calling `DumpSnapshots` on each collector (FR-4.3). On collector restart the rings start empty and refill as new
|
||
traffic arrives.
|
||
- The system does **not** provide per-URI request timing distributions. Latency histograms exist only in the collector's Prometheus
|
||
exposition (per host), not in the top-K data model.
|
||
- The system does **not** ship TLS or authentication for its gRPC endpoints. Operators who expose it beyond a trusted network are
|
||
expected to terminate TLS in a front proxy.
|
||
- The system is **not** a general-purpose metric store. The Prometheus exporter on the collector exposes a deliberately narrow set:
|
||
per-host request counter, per-host body-size and request-time histograms, and per-`source_tag` rollup counters.
|
||
|
||
## Requirements
|
||
|
||
Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that later sections can cite it.
|
||
|
||
### Functional Requirements
|
||
|
||
**FR-1 Counter data model**
|
||
|
||
- **FR-1.1** The canonical unit of counting MUST be a 7-tuple
|
||
`(website, client_prefix, http_request_uri, http_response, is_tor, asn, ipng_source_tag)` mapped to a 64-bit integer request count.
|
||
The data model contains no other fields: no timing, no byte counts, no method (those live only in the Prometheus exposition,
|
||
FR-8).
|
||
- **FR-1.2** `website` MUST be the nginx `$host` value.
|
||
- **FR-1.3** `client_prefix` MUST be the client IP truncated to a configurable prefix length, formatted as CIDR. Default `/24` for
|
||
IPv4 and `/48` for IPv6 (flags `-v4prefix`, `-v6prefix`). Truncation happens at ingest; the original address is not retained.
|
||
- **FR-1.4** `http_request_uri` MUST be the `$request_uri` path only — the query string (from the first `?` onward) MUST be stripped
|
||
at ingest. This is the dominant cardinality-reduction measure; DDoS traffic with attacker-generated query strings cannot grow the
|
||
working set.
|
||
- **FR-1.5** `http_response` MUST be the HTTP status code as recorded by nginx.
|
||
- **FR-1.6** `is_tor` MUST be a boolean, populated by the operator in the log format (typically via a lookup against a TOR exit-node
|
||
list). Operators without TOR data MUST emit literal `0`.
|
||
- **FR-1.7** `asn` MUST be an int32 decimal value sourced from MaxMind GeoIP2 (or equivalent). Operators without GeoIP data MUST
|
||
emit literal `0`.
|
||
- **FR-1.8** `ipng_source_tag` MUST be a short string identifying which attribution tag the request arrived under. The tag is
|
||
always taken verbatim from the log line; the collector does NOT synthesise a fallback. Operators not running
|
||
`nginx-ipng-stats-plugin` MUST emit a literal value (typically `"direct"`).
|
||
|
||
**FR-2 Log formats**
|
||
|
||
- **FR-2.1 Versioned dispatch.** Both the file tailer and the UDP listener MUST funnel every input line through a single
|
||
parser that switches on a leading `v<N>\t` version tag. Lines without a recognised tag — including the legacy
|
||
positional file format — MUST be rejected and counted as parse failures. Two versions are defined: `v1` (FR-2.2) and
|
||
`v2` (FR-2.3). Both ingest paths accept both versions; downstream processing is identical regardless of which path the
|
||
line came in over. `$server_addr` and `$scheme` are parsed but discarded — they are reserved for future use.
|
||
|
||
- **FR-2.2 v1 format.** The v1 payload MUST be exactly 12 tab-separated fields after the `v1` tag (13 fields total).
|
||
|
||
```nginx
|
||
log_format ipng_stats_logtail
|
||
'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t'
|
||
'$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
|
||
```
|
||
|
||
| # | Field | Ingested into |
|
||
|---|-------------------|-------------------------------------|
|
||
| 0 | `v1` | version tag |
|
||
| 1 | `$host` | `website` |
|
||
| 2 | `$remote_addr` | `client_prefix` (truncated) |
|
||
| 3 | `$request_method` | Prom `method` label |
|
||
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
|
||
| 5 | `$status` | `http_response` |
|
||
| 6 | `$body_bytes_sent`| Prom `nginx_http_bytes_sent` |
|
||
| 7 | `$request_time` | Prom `nginx_http_request_duration_seconds` |
|
||
| 8 | `$is_tor` | `is_tor` |
|
||
| 9 | `$asn` | `asn` |
|
||
| 10| `$ipng_source_tag`| `source_tag` |
|
||
| 11| `$server_addr` | *(parsed and discarded)* |
|
||
| 12| `$scheme` | *(parsed and discarded)* |
|
||
|
||
- **FR-2.3 v2 format.** The v2 payload MUST be exactly 15 tab-separated fields after the `v2` tag (16 fields total).
|
||
v2 replaces `$body_bytes_sent` with `$bytes_sent` (full wire bytes including headers) and adds four operationally
|
||
important fields: `$request_length` (request size including headers), `$upstream_response_time`, `$upstream_status`,
|
||
and the existing v1 fields rearranged for clarity.
|
||
|
||
```nginx
|
||
log_format ipng_stats_logtail
|
||
'v2\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t'
|
||
'$bytes_sent\t$request_length\t$request_time\t$upstream_response_time\t$upstream_status\t'
|
||
'$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
|
||
```
|
||
|
||
| # | Field | Ingested into |
|
||
|---|---------------------------|----------------------------------------------|
|
||
| 0 | `v2` | version tag |
|
||
| 1 | `$host` | `website` |
|
||
| 2 | `$remote_addr` | `client_prefix` (truncated) |
|
||
| 3 | `$request_method` | Prom `method` label |
|
||
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
|
||
| 5 | `$status` | `http_response` |
|
||
| 6 | `$bytes_sent` | Prom `nginx_http_bytes_sent` |
|
||
| 7 | `$request_length` | Prom `nginx_http_request_bytes` (v2-only) |
|
||
| 8 | `$request_time` | Prom `nginx_http_request_duration_seconds` |
|
||
| 9 | `$upstream_response_time` | Prom `nginx_http_upstream_duration_seconds` (v2-only) |
|
||
| 10| `$upstream_status` | Prom `nginx_http_upstream_requests_total` (v2-only) |
|
||
| 11| `$is_tor` | `is_tor` |
|
||
| 12| `$asn` | `asn` |
|
||
| 13| `$ipng_source_tag` | `source_tag` |
|
||
| 14| `$server_addr` | *(parsed and discarded)* |
|
||
| 15| `$scheme` | *(parsed and discarded)* |
|
||
|
||
When nginx serves the response without an upstream (static files, redirects, errors), nginx emits literal `-` for
|
||
`$upstream_response_time` and `$upstream_status`. The parser MUST treat that as "no upstream", skip the upstream
|
||
histograms, and not increment the upstream counter. When nginx retries across multiple upstreams, both fields are
|
||
comma-separated; the parser MUST keep the last entry, since that is the upstream that ultimately served the response.
|
||
|
||
- **FR-2.4 Semantic shift on v2 rollout.** v1 fills `nginx_http_bytes_sent` from `$body_bytes_sent`; v2 fills it from
|
||
`$bytes_sent`. Operators MUST expect a small step up in the metric when emitters move from v1 to v2 (header overhead;
|
||
typically a few hundred bytes per response).
|
||
|
||
- **FR-2.5 Malformed input.** Lines with an unknown version, the wrong field count for the claimed version, or an
|
||
unparsable IP MUST be silently dropped. UDP drops MUST be counted via FR-8.6; file-path drops are implicit (the tailer
|
||
falls behind the file).
|
||
|
||
- **FR-2.6 Unknown `$is_tor` / `$asn`.** Operators without TOR or GeoIP data MUST emit literal `0` for both fields. A
|
||
literal `0` in `$is_tor` parses as `false`; a literal `0` in `$asn` parses as ASN `0`, filterable at query time with
|
||
`--asn '!=0'`.
|
||
|
||
**FR-3 Ring buffers and time windows**
|
||
|
||
- **FR-3.1** Each collector and the aggregator MUST maintain two tiered ring buffers:
|
||
|
||
| Tier | Bucket size | Buckets | Top-K/bucket | Covers |
|
||
|--------|-------------|---------|--------------|--------|
|
||
| Fine | 1 min | 60 | 50 000 | 1 h |
|
||
| Coarse | 5 min | 288 | 5 000 | 24 h |
|
||
|
||
- **FR-3.2** The `Window` enum MUST map queries to tiers as follows:
|
||
|
||
| Window | Tier | Buckets summed |
|
||
|--------|--------|----------------|
|
||
| 1 m | fine | 1 |
|
||
| 5 m | fine | 5 |
|
||
| 15 m | fine | 15 |
|
||
| 60 m | fine | 60 |
|
||
| 6 h | coarse | 72 |
|
||
| 24 h | coarse | 288 |
|
||
|
||
- **FR-3.3** Every minute, the collector MUST snapshot its live map into the fine ring (top-50 000, sorted desc) and reset the live
|
||
map. Every fifth fine tick, the collector MUST merge the most recent five fine snapshots into one coarse snapshot (top-5 000).
|
||
The fine/coarse merge MUST be pinned to the 1-minute and 5-minute boundaries of the local clock so sparklines align across
|
||
collectors.
|
||
- **FR-3.4** Querying MUST always read from the rings, never from the live map. A sub-minute request MUST return an empty top-1
|
||
result rather than surfacing partially-accumulated data; this keeps per-minute results monotonic.
|
||
|
||
**FR-4 Push-based streaming and aggregation**
|
||
|
||
- **FR-4.1** The collector MUST expose a server-streaming RPC `StreamSnapshots(SnapshotRequest) → stream Snapshot` that emits one fine
|
||
(1-min) snapshot per minute rotation. Subscribers MUST receive the same snapshot independently (per-subscriber buffered fan-out,
|
||
bounded buffer, drop on full).
|
||
- **FR-4.2** The aggregator MUST subscribe to every configured collector via `StreamSnapshots` and merge snapshots into a single
|
||
ring-buffer cache. The merge strategy MUST be delta-based: on each new snapshot from collector `X`, the aggregator MUST subtract
|
||
`X`'s previous contribution and add the new entries, giving `O(snapshot_size)` per update (not `O(N_collectors × size)`).
|
||
- **FR-4.3** The aggregator MUST expose a unary `DumpSnapshots(DumpSnapshotsRequest) → stream Snapshot` on each collector that
|
||
streams all fine buckets (with `is_coarse=false`) followed by all coarse buckets (with `is_coarse=true`). On aggregator startup, it
|
||
MUST call `DumpSnapshots` against every collector once (concurrently, after its own gRPC server is already listening), merge the
|
||
per-timestamp entries the same way the live path does, and load the result into its cache via a single atomic replacement.
|
||
Collectors that return `Unimplemented` MUST be skipped without blocking live streaming from the others.
|
||
- **FR-4.4** The aggregator MUST reconnect to each collector independently with exponential backoff (100 ms → cap 30 s). After three
|
||
consecutive connection failures the aggregator MUST zero the degraded collector's contribution (subtract its last-known snapshot
|
||
and delete its entry). When the collector recovers and sends a new snapshot, its contribution MUST automatically be reintegrated.
|
||
|
||
**FR-5 Query service (`LogtailService`)**
|
||
|
||
- **FR-5.1** Collector and aggregator MUST implement the same gRPC `LogtailService`:
|
||
|
||
```protobuf
|
||
service LogtailService {
|
||
rpc TopN(TopNRequest) returns (TopNResponse);
|
||
rpc Trend(TrendRequest) returns (TrendResponse);
|
||
rpc StreamSnapshots(SnapshotRequest) returns (stream Snapshot);
|
||
rpc ListTargets(ListTargetsRequest) returns (ListTargetsResponse);
|
||
rpc DumpSnapshots(DumpSnapshotsRequest)returns (stream Snapshot);
|
||
}
|
||
```
|
||
|
||
- **FR-5.2** `Filter` MUST support exact, inequality, and RE2-regex constraints on the dimensions of FR-1. Status and ASN accept
|
||
the six-operator expression language (`=`, `!=`, `>`, `>=`, `<`, `<=`). Website and URI accept regex match and regex exclusion.
|
||
TOR filtering uses a three-state enum (`ANY`/`YES`/`NO`). Source-tag filtering is exact match only.
|
||
- **FR-5.3** `GroupBy` MUST cover every dimension of FR-1 except `is_tor` (which is boolean and rarely useful as a group-by target):
|
||
`WEBSITE`, `CLIENT_PREFIX`, `REQUEST_URI`, `HTTP_RESPONSE`, `ASN_NUMBER`, `SOURCE_TAG`.
|
||
- **FR-5.4** `ListTargets` MUST return, from the aggregator, every configured collector with its display name and gRPC address; from
|
||
a collector, a single entry describing itself with an empty `addr` (meaning "this endpoint").
|
||
- **FR-5.5** All queries MUST be answered from the local ring buffers. The aggregator MUST NOT fan out to collectors at query time.
|
||
|
||
**FR-6 HTTP frontend**
|
||
|
||
- **FR-6.1** The frontend MUST render a server-rendered HTML dashboard with no JavaScript, using inline SVG for sparklines and
|
||
`<meta http-equiv="refresh">` for auto-refresh. It MUST work in text-mode browsers (w3m, lynx) and under `curl`.
|
||
- **FR-6.2** All filter, group-by, and window state MUST live in the URL query string so that URLs are shareable and bookmarkable.
|
||
No server-side session.
|
||
- **FR-6.3** The frontend MUST provide a drilldown affordance: clicking a row MUST add that row's value as a filter and advance the
|
||
group-by dimension through the cycle
|
||
`website → prefix → uri → status → asn → source_tag → website`.
|
||
- **FR-6.4** The frontend MUST issue `TopN`, `Trend`, and `ListTargets` concurrently with a 5 s deadline. `Trend` failure MUST
|
||
suppress the sparkline but not the table. `ListTargets` failure MUST hide the source picker but not the rest of the page.
|
||
- **FR-6.5** Appending `&raw=1` to any URL MUST return the `TopN` result as JSON, so the dashboard can be scripted without the CLI.
|
||
- **FR-6.6** The frontend MUST accept a `q=` parameter holding a mini filter expression (`status>=400 AND website~=gouda.*`). On
|
||
submission it MUST parse the expression and redirect to the canonical URL with the individual `f_*` params populated; parse errors
|
||
MUST render inline without losing the current filter state.
|
||
|
||
**FR-7 CLI**
|
||
|
||
- **FR-7.1** The CLI MUST provide four subcommands: `topn`, `trend`, `stream`, `targets`. Each subcommand MUST accept
|
||
`--target host:port[,host:port...]` and fan out concurrently, printing results in order with per-target headers (omitted for
|
||
single-target invocations, so output pipes cleanly into `jq`).
|
||
- **FR-7.2** The CLI MUST expose every `Filter` dimension as a dedicated flag and default to a human-readable table. `--json` MUST
|
||
switch to newline-delimited JSON for `stream` and to a single JSON array for `topn`/`trend`.
|
||
- **FR-7.3** `stream` MUST reconnect automatically on error with a 5 s backoff and run until interrupted.
|
||
|
||
**FR-8 Prometheus exposition (collector only)**
|
||
|
||
- **FR-8.1** The collector MUST expose a Prometheus `/metrics` endpoint on `-prom-listen` (default `:9100`). Setting the flag to the
|
||
empty string MUST disable it entirely.
|
||
- **FR-8.2** The collector MUST expose a per-request counter `nginx_http_requests_total{host, method, status}` capped at
|
||
`promCounterCap = 250 000` distinct label sets. When the cap is reached, further new label sets MUST be dropped (existing series
|
||
keep incrementing) until the map is rolled over.
|
||
- **FR-8.3** The collector MUST expose two histograms keyed by `{host, source_tag}`:
|
||
`nginx_http_bytes_sent{host, source_tag, le}` (response wire-bytes distribution; FR-2.4) and
|
||
`nginx_http_request_duration_seconds{host, source_tag, le}` (end-to-end request time distribution).
|
||
Cardinality is bounded by `host × source_tag × bucket_count`, which is small enough that no explicit cap is required.
|
||
- **FR-8.4** The collector MUST expose three v2-only metrics that are populated only when v2 records arrive (and, for the
|
||
upstream metrics, only when nginx involved an upstream):
|
||
`nginx_http_request_bytes{host, source_tag, le}` from `$request_length`,
|
||
`nginx_http_upstream_duration_seconds{host, source_tag, le}` from `$upstream_response_time`, and
|
||
`nginx_http_upstream_requests_total{host, source_tag, status_class}` from `$upstream_status`. `status_class` is the
|
||
HTTP class of the upstream's status code, folded to `2xx`/`3xx`/`4xx`/`5xx`/`other`.
|
||
- **FR-8.5** The collector MUST expose a source-tag rollup counter
|
||
`nginx_http_requests_by_source_total{source_tag, status_class}`. `status_class` is the HTTP class of `$status`, folded
|
||
the same way as in FR-8.4. This rollup is intentionally not cross-producted with `host` — its purpose is fleet-wide
|
||
source-attribution health, not per-host detail.
|
||
- **FR-8.6** The collector MUST expose three counters that let operators distinguish UDP parse failures from back-pressure drops:
|
||
`logtail_udp_packets_received_total` (datagrams off the socket, one increment per `recvfrom`),
|
||
`logtail_udp_loglines_success_total` (log lines that parsed OK, incremented once per log line — a single batched datagram from
|
||
the nginx plugin may contribute many), and
|
||
`logtail_udp_loglines_consumed_total` (log lines forwarded to the store channel — i.e. not dropped by back-pressure).
|
||
|
||
### Non-Functional Requirements
|
||
|
||
**NFR-1 Correctness under concurrency**
|
||
|
||
- **NFR-1.1** The collector MUST run a single goroutine ("the store goroutine") that owns the live map and the ring-buffer write
|
||
path. No other goroutine MUST write to these structures. The file tailer and the UDP listener MUST communicate with the store
|
||
goroutine through a bounded channel.
|
||
- **NFR-1.2** Readers (query RPCs and subscriber fan-out) MUST take an `RLock` on the rings. Writers MUST take a `Lock` only for the
|
||
moment the slice header of the new snapshot is installed; serialisation and network I/O MUST happen outside the lock.
|
||
- **NFR-1.3** `DumpSnapshots` MUST copy ring headers and filled counts under `RLock` only, then release the lock before streaming.
|
||
The minute-rotation write path MUST never observe a lock held for longer than a microsecond-scale slice copy.
|
||
- **NFR-1.4** A query that races with a rotation MUST observe a monotonically non-decreasing total for a fixed filter over a fixed
|
||
window; it MUST NOT observe a partially-rotated state that would cause a total to decrease compared to a prior reading.
|
||
|
||
**NFR-2 Memory bounds**
|
||
|
||
- **NFR-2.1** The collector's live map MUST be hard-capped at 100 000 entries. Once the cap is reached, only updates to existing keys
|
||
MUST proceed; new keys MUST be dropped until the next minute rotation resets the map. This bounds memory under high-cardinality
|
||
attacks.
|
||
- **NFR-2.2** Fine-ring snapshots MUST be capped at top-50 000 entries; coarse-ring snapshots at top-5 000. The full memory budget
|
||
for a collector is therefore approximately 845 MB (live map ~19 MB + fine ring ~558 MB + coarse ring ~268 MB).
|
||
- **NFR-2.3** The aggregator MUST apply the same tier caps as the collector. Its steady-state memory is roughly equivalent to one
|
||
collector regardless of the number of collectors subscribed.
|
||
- **NFR-2.4** The Prometheus counter map (FR-8.2) MUST be capped at `promCounterCap = 250 000` entries. The dual-labeled
|
||
`{host, source_tag}` histograms MUST NOT be capped explicitly — they grow only with the cross-product of distinct
|
||
hosts and distinct source tags, both bounded by the operator's nginx configuration.
|
||
|
||
**NFR-3 Performance**
|
||
|
||
- **NFR-3.1** `ParseLine` MUST use `strings.Split` / `strings.IndexByte` (no regex), so that per-line cost stays
|
||
around 50 ns on commodity hardware.
|
||
- **NFR-3.2** `TopN` and `Trend` queries across the full 24-hour coarse ring MUST complete in well under 250 ms at the 50 000-entry
|
||
fine cap, for fully-specified filters.
|
||
- **NFR-3.3** The collector's input channel MUST be sized to absorb approximately 20 s of peak load (e.g. 200 000 at 10 K lines/s)
|
||
so that transient pauses in the store goroutine do not back up the tailer or the UDP listener.
|
||
- **NFR-3.4** When either the tailer or the UDP listener cannot enqueue a parsed record because the channel is full, the record
|
||
MUST be dropped rather than blocking the ingest goroutine. UDP drops MUST be visible via the counters in FR-8.5; file-path drops
|
||
are implicit (the tailer falls behind the file).
|
||
|
||
**NFR-4 Fault tolerance and recovery**
|
||
|
||
- **NFR-4.1** The file tailer MUST tolerate logrotate automatically. On `RENAME`/`REMOVE` events it MUST drain the old file
|
||
descriptor to EOF, close it, and retry opening the original path with exponential backoff until the new file appears. No SIGHUP or
|
||
restart MUST be required.
|
||
- **NFR-4.2** The aggregator MUST NOT block frontend queries during backfill. Its gRPC server MUST start listening first; backfill
|
||
(FR-4.3) MUST run in a background goroutine.
|
||
- **NFR-4.3** A collector restart MUST NOT affect peer collectors or the aggregator's ability to continue serving the surviving
|
||
collectors' data. When the restarted collector reconnects, its stream MUST resume without operator action.
|
||
- **NFR-4.4** An aggregator restart MUST recover its ring-buffer contents from all collectors via `DumpSnapshots`; live streaming
|
||
MUST resume in parallel with backfill so that no minute is lost even during a restart.
|
||
|
||
**NFR-5 Observability of the system itself**
|
||
|
||
- **NFR-5.1** The collector MUST expose operator-facing log lines on stdout covering: file discovery, logrotate reopen events, UDP
|
||
listener bind, subscriber connect/disconnect, and fatal configuration errors. The collector MUST NOT log anything on the per-request
|
||
hot path.
|
||
- **NFR-5.2** The aggregator MUST log each collector's connect, disconnect, degraded transition, and recovery. Backfill MUST log a
|
||
per-collector line with bucket counts, entry counts, and wall-clock duration.
|
||
- **NFR-5.3** The Prometheus exporter MUST be the primary out-of-band health signal. Counters FR-8.5 plus the per-host request
|
||
counter (FR-8.2) give an operator a full view of ingest health without needing to read the logs.
|
||
|
||
**NFR-6 Security**
|
||
|
||
- **NFR-6.1** gRPC traffic MUST be cleartext HTTP/2. Operators who expose the endpoints beyond a trusted network are expected to
|
||
terminate TLS in a front proxy.
|
||
- **NFR-6.2** The collector MUST bind its UDP listener to `127.0.0.1` by default (configurable via `-logtail-bind`) so that merely
|
||
setting `-logtail-port` MUST NOT expose the socket to the public Internet.
|
||
- **NFR-6.3** The system MUST NOT record per-request personally-identifying data beyond what nginx already logs. Client IPs are
|
||
truncated at ingest (FR-1.3); URIs lose their query strings (FR-1.4).
|
||
|
||
**NFR-7 Documentation and packaging**
|
||
|
||
- **NFR-7.1** The repository MUST ship `docs/user-guide.md` that walks an operator through nginx log format configuration, running
|
||
each of the four binaries (flags, systemd examples, Docker Compose), and integrating the Prometheus exporter. It MUST contain
|
||
enough examples that a new operator can stand up a single-host deployment end-to-end without reading the source.
|
||
- **NFR-7.2** The repository MUST ship `docs/design.md` (this document) covering the normative requirements and the architectural
|
||
rationale.
|
||
- **NFR-7.3** All four binaries MUST build as static Go binaries with `CGO_ENABLED=0 -trimpath -ldflags="-s -w"` and MUST ship
|
||
together in a single `scratch`-based Docker image. No OS, no shell, no runtime dependencies.
|
||
|
||
## Architecture Overview
|
||
|
||
### Process Model
|
||
|
||
The project ships four binaries:
|
||
|
||
- **`collector`** — runs on every nginx host. Ingests logs from files and/or UDP, maintains the live map and tiered rings, serves
|
||
`LogtailService` on port 9090, and exposes Prometheus on port 9100.
|
||
- **`aggregator`** — runs centrally. Subscribes to every collector, merges snapshots, serves the same `LogtailService` on port 9091.
|
||
- **`frontend`** — runs centrally, alongside the aggregator. HTTP server on port 8080, rendering HTML against the aggregator (or any
|
||
other `LogtailService` endpoint).
|
||
- **`cli`** — runs wherever the operator is. Talks to any `LogtailService`. No daemon.
|
||
|
||
Because all four binaries speak one service, the aggregator is optional for a single-host deployment: the frontend and CLI can point
|
||
directly at a collector.
|
||
|
||
### Data Flow
|
||
|
||
```
|
||
┌──────────────┐ files ┌───────────────┐
|
||
nginx ──▶ │ access.log │───────▶│ file tailer │
|
||
│ (file mode) │ │ (fsnotify) │──┐
|
||
└──────────────┘ └───────────────┘ │
|
||
│
|
||
┌──────────────┐ UDP ┌───────────────┐ │
|
||
nginx-ipng ▶ ipng_stats_ ├───────▶│ udp listener │──┼──▶ LogRecord ──▶ ┌──────────┐
|
||
-stats- │ logtail │ │ (127.0.0.1) │ │ channel (200K)│ store │
|
||
plugin └──────────────┘ └───────────────┘ │ │ goroutine│
|
||
│ └─────┬────┘
|
||
▼ │
|
||
Prom exporter │
|
||
▼
|
||
┌─────────────┐
|
||
│ live map │
|
||
│ (≤100 K) │
|
||
└──────┬──────┘
|
||
│ every 1 m
|
||
▼
|
||
┌─────────────┐
|
||
│ fine ring │
|
||
│ 60×50 K │────┐
|
||
└──────┬──────┘ │
|
||
│ every 5 m │
|
||
▼ │
|
||
┌─────────────┐ │
|
||
│ coarse ring │ │
|
||
│ 288×5 K │ │
|
||
└─────────────┘ │
|
||
│
|
||
┌──────────────────────────────────────┘
|
||
│ StreamSnapshots (push)
|
||
▼
|
||
aggregator ──▶ merged cache ──▶ frontend / CLI
|
||
```
|
||
|
||
Requests enter nginx. The nginx writes either to a log file (file mode) or via the `ipng_stats_logtail` directive to a UDP socket
|
||
(UDP mode), or both. The collector has two ingest goroutines that parse a line into a `LogRecord` and enqueue it on a shared 200 K
|
||
channel. A single store goroutine consumes the channel, updating the live map and maintaining the tiered rings. A once-per-minute
|
||
timer rotates the live map into the fine ring and (every fifth tick) into the coarse ring, and fans the fresh snapshot out to every
|
||
`StreamSnapshots` subscriber. The aggregator is one such subscriber.
|
||
|
||
Query RPCs (`TopN`, `Trend`) MUST read only from the rings and MUST NOT read from the live map. The aggregator's cache is itself a
|
||
ring built from the merged-view snapshots; it is updated on the same 1-minute cadence regardless of how many collectors are
|
||
connected.
|
||
|
||
## Components
|
||
|
||
### Program 1 — Collector (`cmd/collector`)
|
||
|
||
#### Responsibilities
|
||
|
||
- Tail on-disk log files via a single `fsnotify.Watcher`, handle logrotate, and re-scan glob patterns periodically to pick up new
|
||
files (FR-2.1, NFR-4.1).
|
||
- Listen on an optional UDP socket for `ipng_stats_logtail` datagrams (FR-2.2).
|
||
- Parse each log line into a `LogRecord` (FR-1).
|
||
- Maintain the live map, fine ring, coarse ring, and subscriber fan-out under a single-writer goroutine (FR-3, NFR-1).
|
||
- Serve `LogtailService` on `-listen` (FR-5).
|
||
- Expose Prometheus metrics on `-prom-listen` (FR-8).
|
||
|
||
#### Key data types
|
||
|
||
- `LogRecord` — fourteen fields (website, client_prefix, URI, status, is_tor, asn, method, bytes_sent, request_length,
|
||
request_time, upstream_response_time, upstream_status, has_upstream, source_tag). Produced by `ParseLine` (which
|
||
dispatches on the `v<N>\t` prefix) and consumed by the store goroutine. v1 records leave the v2-only fields
|
||
(`request_length`, upstream_*) at zero / false.
|
||
- `Tuple6` (historical name; carries seven fields now) — the aggregation key. NUL-separated when encoded as a map key for snapshots.
|
||
The code name is intentionally stable so downstream tests and consumers are not churned.
|
||
- `Snapshot` — `(timestamp, []Entry)` where `Entry = (label, count)` and `label` is an encoded `Tuple6`.
|
||
|
||
#### Presents
|
||
|
||
- `LogtailService` on TCP (default `:9090`).
|
||
- A Prometheus `/metrics` handler on TCP (default `:9100`).
|
||
|
||
#### Consumes
|
||
|
||
- One or more on-disk log files matched by `--logs` and/or `--logs-file` globs.
|
||
- Optionally, a UDP socket on `--logtail-bind:--logtail-port` (default `127.0.0.1`, disabled when port is `0`).
|
||
|
||
### Program 2 — Aggregator (`cmd/aggregator`)
|
||
|
||
#### Responsibilities
|
||
|
||
- Dial every configured collector and subscribe via `StreamSnapshots` (FR-4.2).
|
||
- Merge incoming snapshots into a single cache using delta-based subtraction, so a collector's contribution is updated in place
|
||
rather than accumulated (FR-4.2).
|
||
- At startup, call `DumpSnapshots` on each collector once, merge the per-timestamp entries, and load the result into the cache
|
||
atomically (FR-4.3).
|
||
- Handle collector outages with exponential-backoff reconnect and degraded-collector zeroing (FR-4.4).
|
||
- Serve the same `LogtailService` as the collector (FR-5).
|
||
- Maintain a `TargetRegistry` that maps collector addresses to display names (updated from the `source` field of incoming
|
||
snapshots).
|
||
|
||
#### Presents
|
||
|
||
- `LogtailService` on TCP (default `:9091`).
|
||
|
||
#### Consumes
|
||
|
||
- The `StreamSnapshots` and `DumpSnapshots` RPCs on every configured collector (`--collectors`).
|
||
|
||
### Program 3 — Frontend (`cmd/frontend`)
|
||
|
||
#### Responsibilities
|
||
|
||
- Render the drilldown dashboard server-side with no JavaScript (FR-6.1).
|
||
- Parse URL query string into filter / group-by / window state (FR-6.2).
|
||
- Issue `TopN`, `Trend`, and `ListTargets` concurrently with a 5 s deadline (FR-6.4).
|
||
- Render inline SVG sparklines from `TrendResponse` (FR-6.1).
|
||
- Support the mini filter-expression language (FR-6.6) and the `raw=1` JSON output (FR-6.5).
|
||
- Expose a source-picker row populated from `ListTargets`.
|
||
|
||
#### Presents
|
||
|
||
- An HTTP dashboard on TCP (default `:8080`).
|
||
|
||
#### Consumes
|
||
|
||
- Any `LogtailService` endpoint (`--target`, default `localhost:9091` — the aggregator).
|
||
|
||
### Program 4 — CLI (`cmd/cli`)
|
||
|
||
#### Responsibilities
|
||
|
||
- Dispatch to `topn`, `trend`, `stream`, or `targets` (FR-7.1).
|
||
- Parse shared and per-subcommand flags, build a `Filter` proto from them, and fan out to every `--target` concurrently (FR-7.2).
|
||
- Print human-readable tables by default; switch to JSON with `--json` (FR-7.2).
|
||
- Reconnect automatically in `stream` mode (FR-7.3).
|
||
|
||
#### Presents
|
||
|
||
- Exit status `0` on success, non-zero on RPC error (except `stream`, which runs until interrupted).
|
||
|
||
#### Consumes
|
||
|
||
- Any `LogtailService` endpoint.
|
||
|
||
### Protobuf service (`proto/logtail.proto`)
|
||
|
||
One proto file defines every shared type: `Tuple6` is encoded as a NUL-separated label string inside `TopNEntry`, and the
|
||
`Snapshot` message carries both fine (1-min) and coarse (5-min) ring contents. `GroupBy` and `Window` are enums; `Filter` carries
|
||
optional exact-match fields, regex fields, and the `StatusOp` comparison enum used for both `http_response` and `asn_number`.
|
||
|
||
## Operational Concerns
|
||
|
||
### Deployment Topology
|
||
|
||
A typical deployment is:
|
||
|
||
- **Per nginx host:** one `collector` systemd unit, pointed at `/var/log/nginx/*.log` and/or listening on `127.0.0.1:9514` for the
|
||
`nginx-ipng-stats-plugin` UDP stream. Exposes `:9090` (gRPC) and `:9100` (Prometheus).
|
||
- **Central:** one `aggregator` systemd unit on e.g. `agg:9091`, subscribed to all collectors; and one `frontend` systemd unit on
|
||
`agg:8080`, pointed at the aggregator. Operators reach the dashboard via `http://agg:8080/`. Alternatively, the Docker Compose
|
||
file in the repo root runs the aggregator and frontend together.
|
||
- **Operator laptop:** `logtail-cli` invocations, pointed at the aggregator for fleet-wide questions or at a specific collector for
|
||
a single-host drilldown.
|
||
|
||
### Configuration
|
||
|
||
All four binaries are configured via flags with matching environment variables. The canonical reference is `docs/user-guide.md`.
|
||
Representative settings:
|
||
|
||
- `collector`: `--logs /var/log/nginx/*.log`, `--logtail-port 9514`, `--source $(hostname)`, `--prom-listen :9100`.
|
||
- `aggregator`: `--collectors nginx1:9090,nginx2:9090`, `--listen :9091`.
|
||
- `frontend`: `--target agg:9091`, `--listen :8080`.
|
||
- `cli`: no persistent configuration; every invocation carries `--target`.
|
||
|
||
### Reload and Restart Semantics
|
||
|
||
- **Collector restart.** The live map and both rings start empty. The file tailer resumes at EOF of each watched file (no historical
|
||
replay). The fine ring refills within an hour; the coarse ring within 24 hours.
|
||
- **Aggregator restart.** Backfill reconstructs the cache from all collectors' `DumpSnapshots` streams. The gRPC server is listening
|
||
before backfill begins (NFR-4.2), so the frontend is never blocked during restart — it just sees an incomplete cache for the few
|
||
seconds backfill takes.
|
||
- **Collector outage.** The aggregator reconnects with backoff; after three consecutive failures the collector's contribution is
|
||
zeroed (FR-4.4) so the merged view does not show stale counts. On recovery the zeroing is reversed by the next snapshot.
|
||
- **nginx logrotate.** The collector drains the old fd, closes, and retries the original path. No operator action (NFR-4.1).
|
||
- **nginx-ipng-stats-plugin reload.** The plugin's UDP socket is per-worker; a reload simply causes new workers to open fresh
|
||
sockets to the same address. The collector sees a brief gap and resumes.
|
||
|
||
### Observability of the System Itself
|
||
|
||
Primary channel is the collector's Prometheus endpoint (FR-8). Beyond the per-host request counter and the per-source roll-ups,
|
||
three UDP counters give direct visibility into the UDP ingest path:
|
||
|
||
- `logtail_udp_packets_received_total` — what arrived.
|
||
- `logtail_udp_loglines_success_total` — log lines that parsed cleanly (one datagram may contribute many).
|
||
- `logtail_udp_loglines_consumed_total` — log lines that made it to the store (i.e. were not dropped by a full channel).
|
||
|
||
`received - success` is the parse-failure rate; `success - consumed` is the back-pressure drop rate. Operators should alert on both
|
||
being non-zero.
|
||
|
||
Each binary logs human-readable lines on stdout for connect/disconnect events, logrotate reopen, backfill timing, and degraded
|
||
transitions. No per-request logging.
|
||
|
||
### Failure Modes
|
||
|
||
- **High-cardinality DDoS.** The live map hits 100 000 entries and stops accepting new keys until the next rotation (NFR-2.1).
|
||
Existing top-K entries keep accumulating, so the attacker's dominant prefixes / URIs remain visible. The cap resets every minute.
|
||
- **Collector crash.** In-flight live-map state for the current minute is lost. The next collector start resumes tailing; the
|
||
aggregator zeroes the degraded collector's contribution after a few seconds and reintegrates it when snapshots resume.
|
||
- **Aggregator crash.** No collector is affected. The operator restarts the aggregator; backfill reconstructs the cache.
|
||
- **Frontend crash.** Stateless. Operator restarts.
|
||
- **UDP datagram loss.** Any datagram dropped in-kernel (socket buffer full, network drop) does not register as a parse failure; it
|
||
is simply invisible. Operators should size `SO_RCVBUF` appropriately; the collector already requests 4 MiB.
|
||
- **Malformed log lines.** Both ingest paths use the versioned `v<N>\t` parser (FR-2). Lines without a recognised version
|
||
prefix, with the wrong field count for the claimed version, or with a bad IP are silently dropped. UDP drops are
|
||
visible as `packets_received_total - loglines_success_total`; file-path drops are implicit (the tailer simply moves
|
||
past them).
|
||
- **Clock skew between collectors.** Trend sparklines derived from merged data assume collectors are roughly NTP-synced. Per-bucket
|
||
alignment is to the local minute / 5-minute boundary of each collector.
|
||
- **gRPC traffic over untrusted links.** The system does not ship TLS; operators should front the gRPC ports with a TLS-terminating
|
||
proxy or an IPsec tunnel.
|
||
|
||
### Security
|
||
|
||
- **No TLS, no auth.** Deliberate (NFR-6.1). Deploy on a trusted network or behind a TLS proxy.
|
||
- **UDP bind.** Default `127.0.0.1` so merely turning on the listener does not expose a public socket (NFR-6.2).
|
||
- **Client-IP truncation.** Client addresses are truncated at ingest; the system never stores full client IPs (NFR-6.3, FR-1.3).
|
||
- **Query-string stripping.** URIs lose their query strings at ingest. A user who cares about `?q=` parameters must re-engineer
|
||
nginx's log format — and then accept that cardinality consequence.
|
||
|
||
## Alternatives Considered
|
||
|
||
- **Log shipping to ClickHouse / ELK.** Rejected as the default: adds a storage tier to a problem that fits in a per-host 1 GB
|
||
ring, for the target fleet size. A future ClickHouse export from the aggregator is viable and would be additive (deferred).
|
||
- **Raw request logging to Kafka.** Rejected: preserves every request at much higher cost for no visibility benefit; the operator
|
||
wants top-K ranking, not a replay log. If raw logging is desired, nginx's own access log is the right tool.
|
||
- **Promtail / Grafana Loki.** Rejected as the primary interface. Loki is excellent for free-text log search but weak for fast
|
||
ranked aggregations over dozens of dimensions; the drilldown interaction the operator wants fits poorly into LogQL.
|
||
- **In-process Lua aggregator on each nginx.** Considered for the collector tier. Rejected: shipping counters to a central view
|
||
still requires a process outside nginx; keeping the ingest path out of the nginx worker avoids a class of latency regressions.
|
||
- **pull-based collector polling (aggregator polls collectors every second).** Rejected in favor of push. Polling multiplies query
|
||
latency and makes the aggregator's cache stale by the poll interval. Push-stream with delta merge keeps the cache within seconds
|
||
of real time.
|
||
- **Separate `_by_source` metric names with a single label.** The original v0.2 layout exposed `_by_source` siblings to
|
||
avoid mixing label sets under one metric name. Superseded by the v0.3 layout: histograms now carry both `host` and
|
||
`source_tag` directly, and the source-tag rollup counter gains a `status_class` label. Cardinality stays bounded
|
||
(~7 hosts × ~6 tags × 11 buckets ≈ 460 series per histogram), and Grafana queries become simpler (`sum by(source_tag)`
|
||
rather than picking a different metric name).
|
||
- **Writing every `snapshot` to disk for restart recovery.** Rejected in favor of `DumpSnapshots` RPC backfill. Disk-backed
|
||
persistence would multiply operational surface (rotation, fsck, permissions) for a feature that needs to survive only an
|
||
aggregator restart.
|
||
|
||
## Decisions Deferred Post-v0.2
|
||
|
||
- **ClickHouse export from aggregator.** 1-minute pre-aggregated rows pushed into a `SummingMergeTree` table for 7-day / 30-day
|
||
windows. Frontend would route longer windows to ClickHouse while shorter windows stay on the in-memory rings. Strictly additive;
|
||
no interface changes. Deferred until a concrete retention requirement lands.
|
||
- **TLS on gRPC endpoints.** The argument for shipping TLS changes if/when the aggregator is deployed across an untrusted network
|
||
segment. Until then, a front proxy is the right shape.
|
||
- **Ring-buffer sizing on a per-collector basis.** Today every collector ships the same 60×50 K / 288×5 K dimensions. A
|
||
low-traffic collector can afford smaller rings; a hot one might want larger. Deferred — the uniform default is operationally
|
||
simpler.
|
||
- **Authenticated Prometheus scraping.** The endpoint is currently open on `:9100`. If a future deployment puts the scraper on a
|
||
less-trusted path, scrape-side auth (bearer token, TLS client cert) is the right add-on.
|
||
- **Coarse tier beyond 24 h.** Extending to 7 days in-memory would cost ~70 MB per collector but add 2016 buckets to iterate on a
|
||
`W24H+` query. Deferred until the operator wants a 7-day drilldown without ClickHouse.
|