RELEASE 1.0.1: v2 log format, source_tag-labeled metrics, lint cleanup
Wire-format and metric overhaul. Both file and UDP ingest now share one
versioned ParseLine that dispatches on the v<N>\t prefix; v1 stays
unchanged, v2 adds $bytes_sent (replacing $body_bytes_sent),
$request_length, $upstream_response_time, and $upstream_status. File
ingest gains the same versioning, and the legacy positional file format
is removed (no live deployments).
Prometheus exposition is rewritten:
- nginx_http_bytes_sent and nginx_http_request_duration_seconds gain
a source_tag label.
- nginx_http_requests_by_source_total gains status_class.
- New v2-only metrics: nginx_http_request_bytes,
nginx_http_upstream_duration_seconds,
nginx_http_upstream_requests_total{status_class}.
- Dropped nginx_http_response_body_bytes_by_source (subsumed by the
dual-labeled bytes_sent metric).
Adds 'make fixstyle' (gofmt -w) and clears all golangci-lint findings
across the repo (errcheck, S1001, ST1005, unused).
Docs in design.md FR-2/FR-8 and user-guide.md are rewritten to present
v2 as the recommended log format.
This commit is contained in:
+108
-56
@@ -64,8 +64,9 @@ natively, so operators can run it either from on-disk log files, from the UDP fe
|
||||
|
||||
### Non-Goals
|
||||
|
||||
- The system does **not** parse arbitrary nginx `log_format` strings. Two fixed tab-separated formats are supported: a file format and
|
||||
a UDP format (see FR-2). Operators who need general parsing should use Vector, Fluent Bit, or Promtail.
|
||||
- The system does **not** parse arbitrary nginx `log_format` strings. A single versioned tab-separated format is
|
||||
supported on both file and UDP ingest (see FR-2). Operators who need general parsing should use Vector, Fluent Bit, or
|
||||
Promtail.
|
||||
- The system does **not** store raw log lines. Counts are aggregated at ingest; the original log lines are not kept in memory or on
|
||||
disk. The project does not replace an access log.
|
||||
- The system does **not** persist counters across restarts. Ring buffers are in-memory only. On aggregator restart, historical state
|
||||
@@ -98,50 +99,92 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
|
||||
working set.
|
||||
- **FR-1.5** `http_response` MUST be the HTTP status code as recorded by nginx.
|
||||
- **FR-1.6** `is_tor` MUST be a boolean, populated by the operator in the log format (typically via a lookup against a TOR exit-node
|
||||
list). For the file format, lines without this field default to `false` for backward compatibility.
|
||||
- **FR-1.7** `asn` MUST be an int32 decimal value sourced from MaxMind GeoIP2 (or equivalent). For the file format, lines without
|
||||
this field default to `0`.
|
||||
- **FR-1.8** `ipng_source_tag` MUST be a short string identifying which attribution tag the request arrived under. For records from
|
||||
on-disk log files, the collector MUST assign the tag `"direct"` (mirroring `nginx-ipng-stats-plugin`'s default-source convention). For
|
||||
records from the UDP stream, the tag is taken from the log line as emitted by the plugin.
|
||||
list). Operators without TOR data MUST emit literal `0`.
|
||||
- **FR-1.7** `asn` MUST be an int32 decimal value sourced from MaxMind GeoIP2 (or equivalent). Operators without GeoIP data MUST
|
||||
emit literal `0`.
|
||||
- **FR-1.8** `ipng_source_tag` MUST be a short string identifying which attribution tag the request arrived under. The tag is
|
||||
always taken verbatim from the log line; the collector does NOT synthesise a fallback. Operators not running
|
||||
`nginx-ipng-stats-plugin` MUST emit a literal value (typically `"direct"`).
|
||||
|
||||
**FR-2 Log formats**
|
||||
|
||||
- **FR-2.1 File format.** The collector MUST accept nginx access logs in the following tab-separated layout, with the last two fields
|
||||
(`is_tor`, `asn`) optional for backward compatibility:
|
||||
- **FR-2.1 Versioned dispatch.** Both the file tailer and the UDP listener MUST funnel every input line through a single
|
||||
parser that switches on a leading `v<N>\t` version tag. Lines without a recognised tag — including the legacy
|
||||
positional file format — MUST be rejected and counted as parse failures. Two versions are defined: `v1` (FR-2.2) and
|
||||
`v2` (FR-2.3). Both ingest paths accept both versions; downstream processing is identical regardless of which path the
|
||||
line came in over. `$server_addr` and `$scheme` are parsed but discarded — they are reserved for future use.
|
||||
|
||||
- **FR-2.2 v1 format.** The v1 payload MUST be exactly 12 tab-separated fields after the `v1` tag (13 fields total).
|
||||
|
||||
```nginx
|
||||
log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn';
|
||||
log_format ipng_stats_logtail
|
||||
'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t'
|
||||
'$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
|
||||
```
|
||||
|
||||
| # | Field | Ingested into |
|
||||
|---|-------------------|----------------------------|
|
||||
| 0 | `$host` | `website` |
|
||||
| 1 | `$remote_addr` | `client_prefix` (truncated)|
|
||||
| 2 | `$msec` | (discarded) |
|
||||
| 3 | `$request_method` | Prom `method` label |
|
||||
| 4 | `$request_uri` | `http_request_uri` |
|
||||
| 5 | `$status` | `http_response` |
|
||||
| 6 | `$body_bytes_sent`| Prom body histogram |
|
||||
| 7 | `$request_time` | Prom duration histogram |
|
||||
| 8 | `$is_tor` | `is_tor` (optional) |
|
||||
| 9 | `$asn` | `asn` (optional) |
|
||||
| # | Field | Ingested into |
|
||||
|---|-------------------|-------------------------------------|
|
||||
| 0 | `v1` | version tag |
|
||||
| 1 | `$host` | `website` |
|
||||
| 2 | `$remote_addr` | `client_prefix` (truncated) |
|
||||
| 3 | `$request_method` | Prom `method` label |
|
||||
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
|
||||
| 5 | `$status` | `http_response` |
|
||||
| 6 | `$body_bytes_sent`| Prom `nginx_http_bytes_sent` |
|
||||
| 7 | `$request_time` | Prom `nginx_http_request_duration_seconds` |
|
||||
| 8 | `$is_tor` | `is_tor` |
|
||||
| 9 | `$asn` | `asn` |
|
||||
| 10| `$ipng_source_tag`| `source_tag` |
|
||||
| 11| `$server_addr` | *(parsed and discarded)* |
|
||||
| 12| `$scheme` | *(parsed and discarded)* |
|
||||
|
||||
- **FR-2.2 UDP format.** The collector MUST accept datagrams in a versioned tab-separated layout, as emitted by
|
||||
`nginx-ipng-stats-plugin`'s `ipng_stats_logtail` directive. Every datagram MUST begin with a literal version tag
|
||||
(`v<N>\t`) so the collector can route each packet to the appropriate parser. Only `v1` is defined in this revision;
|
||||
unknown versions MUST be counted as parse failures and dropped.
|
||||
- **FR-2.3 v2 format.** The v2 payload MUST be exactly 15 tab-separated fields after the `v2` tag (16 fields total).
|
||||
v2 replaces `$body_bytes_sent` with `$bytes_sent` (full wire bytes including headers) and adds four operationally
|
||||
important fields: `$request_length` (request size including headers), `$upstream_response_time`, `$upstream_status`,
|
||||
and the existing v1 fields rearranged for clarity.
|
||||
|
||||
```nginx
|
||||
log_format ipng_stats_logtail 'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
|
||||
log_format ipng_stats_logtail
|
||||
'v2\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t'
|
||||
'$bytes_sent\t$request_length\t$request_time\t$upstream_response_time\t$upstream_status\t'
|
||||
'$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
|
||||
```
|
||||
|
||||
The v1 payload MUST have exactly 12 tab-separated fields after the `v1` tag (13 fields total). `$server_addr` and
|
||||
`$scheme` MUST be parsed but dropped; they are reserved for future use. Malformed datagrams (wrong version, wrong
|
||||
field count, bad IP) MUST be counted (FR-8.5) and silently dropped.
|
||||
| # | Field | Ingested into |
|
||||
|---|---------------------------|----------------------------------------------|
|
||||
| 0 | `v2` | version tag |
|
||||
| 1 | `$host` | `website` |
|
||||
| 2 | `$remote_addr` | `client_prefix` (truncated) |
|
||||
| 3 | `$request_method` | Prom `method` label |
|
||||
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
|
||||
| 5 | `$status` | `http_response` |
|
||||
| 6 | `$bytes_sent` | Prom `nginx_http_bytes_sent` |
|
||||
| 7 | `$request_length` | Prom `nginx_http_request_bytes` (v2-only) |
|
||||
| 8 | `$request_time` | Prom `nginx_http_request_duration_seconds` |
|
||||
| 9 | `$upstream_response_time` | Prom `nginx_http_upstream_duration_seconds` (v2-only) |
|
||||
| 10| `$upstream_status` | Prom `nginx_http_upstream_requests_total` (v2-only) |
|
||||
| 11| `$is_tor` | `is_tor` |
|
||||
| 12| `$asn` | `asn` |
|
||||
| 13| `$ipng_source_tag` | `source_tag` |
|
||||
| 14| `$server_addr` | *(parsed and discarded)* |
|
||||
| 15| `$scheme` | *(parsed and discarded)* |
|
||||
|
||||
- **FR-2.3** The file tailer MUST set `source_tag="direct"` on every record it parses. The UDP listener MUST propagate
|
||||
`$ipng_source_tag` verbatim. This is the only difference in downstream processing between the two ingest paths.
|
||||
When nginx serves the response without an upstream (static files, redirects, errors), nginx emits literal `-` for
|
||||
`$upstream_response_time` and `$upstream_status`. The parser MUST treat that as "no upstream", skip the upstream
|
||||
histograms, and not increment the upstream counter. When nginx retries across multiple upstreams, both fields are
|
||||
comma-separated; the parser MUST keep the last entry, since that is the upstream that ultimately served the response.
|
||||
|
||||
- **FR-2.4 Semantic shift on v2 rollout.** v1 fills `nginx_http_bytes_sent` from `$body_bytes_sent`; v2 fills it from
|
||||
`$bytes_sent`. Operators MUST expect a small step up in the metric when emitters move from v1 to v2 (header overhead;
|
||||
typically a few hundred bytes per response).
|
||||
|
||||
- **FR-2.5 Malformed input.** Lines with an unknown version, the wrong field count for the claimed version, or an
|
||||
unparsable IP MUST be silently dropped. UDP drops MUST be counted via FR-8.6; file-path drops are implicit (the tailer
|
||||
falls behind the file).
|
||||
|
||||
- **FR-2.6 Unknown `$is_tor` / `$asn`.** Operators without TOR or GeoIP data MUST emit literal `0` for both fields. A
|
||||
literal `0` in `$is_tor` parses as `false`; a literal `0` in `$asn` parses as ASN `0`, filterable at query time with
|
||||
`--asn '!=0'`.
|
||||
|
||||
**FR-3 Ring buffers and time windows**
|
||||
|
||||
@@ -242,15 +285,21 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
|
||||
- **FR-8.2** The collector MUST expose a per-request counter `nginx_http_requests_total{host, method, status}` capped at
|
||||
`promCounterCap = 250 000` distinct label sets. When the cap is reached, further new label sets MUST be dropped (existing series
|
||||
keep incrementing) until the map is rolled over.
|
||||
- **FR-8.3** The collector MUST expose per-host histograms
|
||||
`nginx_http_response_body_bytes{host, le}` (body-size distribution) and
|
||||
`nginx_http_request_duration_seconds{host, le}` (request-time distribution). The duration histogram MUST NOT be split by
|
||||
`source_tag` — its bucket count would multiply without operational benefit.
|
||||
- **FR-8.4** The collector MUST expose two parallel roll-ups labeled by `source_tag` only (not cross-producted with host):
|
||||
`nginx_http_requests_by_source_total{source_tag}` and
|
||||
`nginx_http_response_body_bytes_by_source{source_tag, le}`. These are separate metric names to avoid inconsistent label sets
|
||||
under a single name.
|
||||
- **FR-8.5** The collector MUST expose three counters that let operators distinguish UDP parse failures from back-pressure drops:
|
||||
- **FR-8.3** The collector MUST expose two histograms keyed by `{host, source_tag}`:
|
||||
`nginx_http_bytes_sent{host, source_tag, le}` (response wire-bytes distribution; FR-2.4) and
|
||||
`nginx_http_request_duration_seconds{host, source_tag, le}` (end-to-end request time distribution).
|
||||
Cardinality is bounded by `host × source_tag × bucket_count`, which is small enough that no explicit cap is required.
|
||||
- **FR-8.4** The collector MUST expose three v2-only metrics that are populated only when v2 records arrive (and, for the
|
||||
upstream metrics, only when nginx involved an upstream):
|
||||
`nginx_http_request_bytes{host, source_tag, le}` from `$request_length`,
|
||||
`nginx_http_upstream_duration_seconds{host, source_tag, le}` from `$upstream_response_time`, and
|
||||
`nginx_http_upstream_requests_total{host, source_tag, status_class}` from `$upstream_status`. `status_class` is the
|
||||
HTTP class of the upstream's status code, folded to `2xx`/`3xx`/`4xx`/`5xx`/`other`.
|
||||
- **FR-8.5** The collector MUST expose a source-tag rollup counter
|
||||
`nginx_http_requests_by_source_total{source_tag, status_class}`. `status_class` is the HTTP class of `$status`, folded
|
||||
the same way as in FR-8.4. This rollup is intentionally not cross-producted with `host` — its purpose is fleet-wide
|
||||
source-attribution health, not per-host detail.
|
||||
- **FR-8.6** The collector MUST expose three counters that let operators distinguish UDP parse failures from back-pressure drops:
|
||||
`logtail_udp_packets_received_total` (datagrams off the socket, one increment per `recvfrom`),
|
||||
`logtail_udp_loglines_success_total` (log lines that parsed OK, incremented once per log line — a single batched datagram from
|
||||
the nginx plugin may contribute many), and
|
||||
@@ -279,13 +328,13 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
|
||||
for a collector is therefore approximately 845 MB (live map ~19 MB + fine ring ~558 MB + coarse ring ~268 MB).
|
||||
- **NFR-2.3** The aggregator MUST apply the same tier caps as the collector. Its steady-state memory is roughly equivalent to one
|
||||
collector regardless of the number of collectors subscribed.
|
||||
- **NFR-2.4** The Prometheus counter map (FR-8.2) MUST be capped at `promCounterCap = 250 000` entries. The per-host and per-source
|
||||
histograms MUST NOT be capped explicitly — they grow only with the distinct host count, which is bounded by the operator's vhost
|
||||
configuration.
|
||||
- **NFR-2.4** The Prometheus counter map (FR-8.2) MUST be capped at `promCounterCap = 250 000` entries. The dual-labeled
|
||||
`{host, source_tag}` histograms MUST NOT be capped explicitly — they grow only with the cross-product of distinct
|
||||
hosts and distinct source tags, both bounded by the operator's nginx configuration.
|
||||
|
||||
**NFR-3 Performance**
|
||||
|
||||
- **NFR-3.1** `ParseLine` and `ParseUDPLine` MUST use `strings.Split` / `strings.SplitN` (no regex), so that per-line cost stays
|
||||
- **NFR-3.1** `ParseLine` MUST use `strings.Split` / `strings.IndexByte` (no regex), so that per-line cost stays
|
||||
around 50 ns on commodity hardware.
|
||||
- **NFR-3.2** `TopN` and `Trend` queries across the full 24-hour coarse ring MUST complete in well under 250 ms at the 50 000-entry
|
||||
fine cap, for fully-specified filters.
|
||||
@@ -417,8 +466,10 @@ connected.
|
||||
|
||||
#### Key data types
|
||||
|
||||
- `LogRecord` — ten fields (website, client_prefix, URI, status, is_tor, asn, method, body_bytes_sent, request_time, source_tag).
|
||||
Produced by `ParseLine` or `ParseUDPLine` and consumed by the store goroutine.
|
||||
- `LogRecord` — fourteen fields (website, client_prefix, URI, status, is_tor, asn, method, bytes_sent, request_length,
|
||||
request_time, upstream_response_time, upstream_status, has_upstream, source_tag). Produced by `ParseLine` (which
|
||||
dispatches on the `v<N>\t` prefix) and consumed by the store goroutine. v1 records leave the v2-only fields
|
||||
(`request_length`, upstream_*) at zero / false.
|
||||
- `Tuple6` (historical name; carries seven fields now) — the aggregation key. NUL-separated when encoded as a map key for snapshots.
|
||||
The code name is intentionally stable so downstream tests and consumers are not churned.
|
||||
- `Snapshot` — `(timestamp, []Entry)` where `Entry = (label, count)` and `label` is an encoded `Tuple6`.
|
||||
@@ -559,9 +610,10 @@ transitions. No per-request logging.
|
||||
- **Frontend crash.** Stateless. Operator restarts.
|
||||
- **UDP datagram loss.** Any datagram dropped in-kernel (socket buffer full, network drop) does not register as a parse failure; it
|
||||
is simply invisible. Operators should size `SO_RCVBUF` appropriately; the collector already requests 4 MiB.
|
||||
- **Malformed log lines.** File format: lines with <8 tab-separated fields are silently skipped; an invalid IP also drops the line.
|
||||
UDP: packets without a recognised `v<N>\t` prefix, or with the wrong field count for the claimed version, or with a bad IP, are
|
||||
counted as received-but-not-success and dropped.
|
||||
- **Malformed log lines.** Both ingest paths use the versioned `v<N>\t` parser (FR-2). Lines without a recognised version
|
||||
prefix, with the wrong field count for the claimed version, or with a bad IP are silently dropped. UDP drops are
|
||||
visible as `packets_received_total - loglines_success_total`; file-path drops are implicit (the tailer simply moves
|
||||
past them).
|
||||
- **Clock skew between collectors.** Trend sparklines derived from merged data assume collectors are roughly NTP-synced. Per-bucket
|
||||
alignment is to the local minute / 5-minute boundary of each collector.
|
||||
- **gRPC traffic over untrusted links.** The system does not ship TLS; operators should front the gRPC ports with a TLS-terminating
|
||||
@@ -588,11 +640,11 @@ transitions. No per-request logging.
|
||||
- **pull-based collector polling (aggregator polls collectors every second).** Rejected in favor of push. Polling multiplies query
|
||||
latency and makes the aggregator's cache stale by the poll interval. Push-stream with delta merge keeps the cache within seconds
|
||||
of real time.
|
||||
- **One metric name for both per-host and per-source_tag roll-ups.** Rejected for Prometheus hygiene. Mixing different label sets
|
||||
under one metric name breaks aggregation rules; separate metric names (`_by_source`) are clearer and easier to query.
|
||||
- **Cross-product of `host × source_tag` for every counter and histogram.** Rejected. With ~20 tags and ~50 hosts the cardinality
|
||||
explodes quickly on the duration histogram without operational benefit. The duration histogram stays per-host; requests and body
|
||||
size get a parallel `_by_source` rollup.
|
||||
- **Separate `_by_source` metric names with a single label.** The original v0.2 layout exposed `_by_source` siblings to
|
||||
avoid mixing label sets under one metric name. Superseded by the v0.3 layout: histograms now carry both `host` and
|
||||
`source_tag` directly, and the source-tag rollup counter gains a `status_class` label. Cardinality stays bounded
|
||||
(~7 hosts × ~6 tags × 11 buckets ≈ 460 series per histogram), and Grafana queries become simpler (`sum by(source_tag)`
|
||||
rather than picking a different metric name).
|
||||
- **Writing every `snapshot` to disk for restart recovery.** Rejected in favor of `DumpSnapshots` RPC backfill. Disk-backed
|
||||
persistence would multiply operational surface (rotation, fsck, permissions) for a feature that needs to survive only an
|
||||
aggregator restart.
|
||||
|
||||
Reference in New Issue
Block a user