RELEASE 1.0.1: v2 log format, source_tag-labeled metrics, lint cleanup

Wire-format and metric overhaul. Both file and UDP ingest now share one
versioned ParseLine that dispatches on the v<N>\t prefix; v1 stays
unchanged, v2 adds $bytes_sent (replacing $body_bytes_sent),
$request_length, $upstream_response_time, and $upstream_status. File
ingest gains the same versioning, and the legacy positional file format
is removed (no live deployments).

Prometheus exposition is rewritten:

  - nginx_http_bytes_sent and nginx_http_request_duration_seconds gain
    a source_tag label.
  - nginx_http_requests_by_source_total gains status_class.
  - New v2-only metrics: nginx_http_request_bytes,
    nginx_http_upstream_duration_seconds,
    nginx_http_upstream_requests_total{status_class}.
  - Dropped nginx_http_response_body_bytes_by_source (subsumed by the
    dual-labeled bytes_sent metric).

Adds 'make fixstyle' (gofmt -w) and clears all golangci-lint findings
across the repo (errcheck, S1001, ST1005, unused).

Docs in design.md FR-2/FR-8 and user-guide.md are rewritten to present
v2 as the recommended log format.
This commit is contained in:
2026-05-01 15:40:53 +02:00
parent d1a21a7a62
commit 6647f95be4
28 changed files with 931 additions and 724 deletions
+108 -56
View File
@@ -64,8 +64,9 @@ natively, so operators can run it either from on-disk log files, from the UDP fe
### Non-Goals
- The system does **not** parse arbitrary nginx `log_format` strings. Two fixed tab-separated formats are supported: a file format and
a UDP format (see FR-2). Operators who need general parsing should use Vector, Fluent Bit, or Promtail.
- The system does **not** parse arbitrary nginx `log_format` strings. A single versioned tab-separated format is
supported on both file and UDP ingest (see FR-2). Operators who need general parsing should use Vector, Fluent Bit, or
Promtail.
- The system does **not** store raw log lines. Counts are aggregated at ingest; the original log lines are not kept in memory or on
disk. The project does not replace an access log.
- The system does **not** persist counters across restarts. Ring buffers are in-memory only. On aggregator restart, historical state
@@ -98,50 +99,92 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
working set.
- **FR-1.5** `http_response` MUST be the HTTP status code as recorded by nginx.
- **FR-1.6** `is_tor` MUST be a boolean, populated by the operator in the log format (typically via a lookup against a TOR exit-node
list). For the file format, lines without this field default to `false` for backward compatibility.
- **FR-1.7** `asn` MUST be an int32 decimal value sourced from MaxMind GeoIP2 (or equivalent). For the file format, lines without
this field default to `0`.
- **FR-1.8** `ipng_source_tag` MUST be a short string identifying which attribution tag the request arrived under. For records from
on-disk log files, the collector MUST assign the tag `"direct"` (mirroring `nginx-ipng-stats-plugin`'s default-source convention). For
records from the UDP stream, the tag is taken from the log line as emitted by the plugin.
list). Operators without TOR data MUST emit literal `0`.
- **FR-1.7** `asn` MUST be an int32 decimal value sourced from MaxMind GeoIP2 (or equivalent). Operators without GeoIP data MUST
emit literal `0`.
- **FR-1.8** `ipng_source_tag` MUST be a short string identifying which attribution tag the request arrived under. The tag is
always taken verbatim from the log line; the collector does NOT synthesise a fallback. Operators not running
`nginx-ipng-stats-plugin` MUST emit a literal value (typically `"direct"`).
**FR-2 Log formats**
- **FR-2.1 File format.** The collector MUST accept nginx access logs in the following tab-separated layout, with the last two fields
(`is_tor`, `asn`) optional for backward compatibility:
- **FR-2.1 Versioned dispatch.** Both the file tailer and the UDP listener MUST funnel every input line through a single
parser that switches on a leading `v<N>\t` version tag. Lines without a recognised tag — including the legacy
positional file format — MUST be rejected and counted as parse failures. Two versions are defined: `v1` (FR-2.2) and
`v2` (FR-2.3). Both ingest paths accept both versions; downstream processing is identical regardless of which path the
line came in over. `$server_addr` and `$scheme` are parsed but discarded — they are reserved for future use.
- **FR-2.2 v1 format.** The v1 payload MUST be exactly 12 tab-separated fields after the `v1` tag (13 fields total).
```nginx
log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn';
log_format ipng_stats_logtail
'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t'
'$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
```
| # | Field | Ingested into |
|---|-------------------|----------------------------|
| 0 | `$host` | `website` |
| 1 | `$remote_addr` | `client_prefix` (truncated)|
| 2 | `$msec` | (discarded) |
| 3 | `$request_method` | Prom `method` label |
| 4 | `$request_uri` | `http_request_uri` |
| 5 | `$status` | `http_response` |
| 6 | `$body_bytes_sent`| Prom body histogram |
| 7 | `$request_time` | Prom duration histogram |
| 8 | `$is_tor` | `is_tor` (optional) |
| 9 | `$asn` | `asn` (optional) |
| # | Field | Ingested into |
|---|-------------------|-------------------------------------|
| 0 | `v1` | version tag |
| 1 | `$host` | `website` |
| 2 | `$remote_addr` | `client_prefix` (truncated) |
| 3 | `$request_method` | Prom `method` label |
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
| 5 | `$status` | `http_response` |
| 6 | `$body_bytes_sent`| Prom `nginx_http_bytes_sent` |
| 7 | `$request_time` | Prom `nginx_http_request_duration_seconds` |
| 8 | `$is_tor` | `is_tor` |
| 9 | `$asn` | `asn` |
| 10| `$ipng_source_tag`| `source_tag` |
| 11| `$server_addr` | *(parsed and discarded)* |
| 12| `$scheme` | *(parsed and discarded)* |
- **FR-2.2 UDP format.** The collector MUST accept datagrams in a versioned tab-separated layout, as emitted by
`nginx-ipng-stats-plugin`'s `ipng_stats_logtail` directive. Every datagram MUST begin with a literal version tag
(`v<N>\t`) so the collector can route each packet to the appropriate parser. Only `v1` is defined in this revision;
unknown versions MUST be counted as parse failures and dropped.
- **FR-2.3 v2 format.** The v2 payload MUST be exactly 15 tab-separated fields after the `v2` tag (16 fields total).
v2 replaces `$body_bytes_sent` with `$bytes_sent` (full wire bytes including headers) and adds four operationally
important fields: `$request_length` (request size including headers), `$upstream_response_time`, `$upstream_status`,
and the existing v1 fields rearranged for clarity.
```nginx
log_format ipng_stats_logtail 'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
log_format ipng_stats_logtail
'v2\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t'
'$bytes_sent\t$request_length\t$request_time\t$upstream_response_time\t$upstream_status\t'
'$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
```
The v1 payload MUST have exactly 12 tab-separated fields after the `v1` tag (13 fields total). `$server_addr` and
`$scheme` MUST be parsed but dropped; they are reserved for future use. Malformed datagrams (wrong version, wrong
field count, bad IP) MUST be counted (FR-8.5) and silently dropped.
| # | Field | Ingested into |
|---|---------------------------|----------------------------------------------|
| 0 | `v2` | version tag |
| 1 | `$host` | `website` |
| 2 | `$remote_addr` | `client_prefix` (truncated) |
| 3 | `$request_method` | Prom `method` label |
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
| 5 | `$status` | `http_response` |
| 6 | `$bytes_sent` | Prom `nginx_http_bytes_sent` |
| 7 | `$request_length` | Prom `nginx_http_request_bytes` (v2-only) |
| 8 | `$request_time` | Prom `nginx_http_request_duration_seconds` |
| 9 | `$upstream_response_time` | Prom `nginx_http_upstream_duration_seconds` (v2-only) |
| 10| `$upstream_status` | Prom `nginx_http_upstream_requests_total` (v2-only) |
| 11| `$is_tor` | `is_tor` |
| 12| `$asn` | `asn` |
| 13| `$ipng_source_tag` | `source_tag` |
| 14| `$server_addr` | *(parsed and discarded)* |
| 15| `$scheme` | *(parsed and discarded)* |
- **FR-2.3** The file tailer MUST set `source_tag="direct"` on every record it parses. The UDP listener MUST propagate
`$ipng_source_tag` verbatim. This is the only difference in downstream processing between the two ingest paths.
When nginx serves the response without an upstream (static files, redirects, errors), nginx emits literal `-` for
`$upstream_response_time` and `$upstream_status`. The parser MUST treat that as "no upstream", skip the upstream
histograms, and not increment the upstream counter. When nginx retries across multiple upstreams, both fields are
comma-separated; the parser MUST keep the last entry, since that is the upstream that ultimately served the response.
- **FR-2.4 Semantic shift on v2 rollout.** v1 fills `nginx_http_bytes_sent` from `$body_bytes_sent`; v2 fills it from
`$bytes_sent`. Operators MUST expect a small step up in the metric when emitters move from v1 to v2 (header overhead;
typically a few hundred bytes per response).
- **FR-2.5 Malformed input.** Lines with an unknown version, the wrong field count for the claimed version, or an
unparsable IP MUST be silently dropped. UDP drops MUST be counted via FR-8.6; file-path drops are implicit (the tailer
falls behind the file).
- **FR-2.6 Unknown `$is_tor` / `$asn`.** Operators without TOR or GeoIP data MUST emit literal `0` for both fields. A
literal `0` in `$is_tor` parses as `false`; a literal `0` in `$asn` parses as ASN `0`, filterable at query time with
`--asn '!=0'`.
**FR-3 Ring buffers and time windows**
@@ -242,15 +285,21 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
- **FR-8.2** The collector MUST expose a per-request counter `nginx_http_requests_total{host, method, status}` capped at
`promCounterCap = 250 000` distinct label sets. When the cap is reached, further new label sets MUST be dropped (existing series
keep incrementing) until the map is rolled over.
- **FR-8.3** The collector MUST expose per-host histograms
`nginx_http_response_body_bytes{host, le}` (body-size distribution) and
`nginx_http_request_duration_seconds{host, le}` (request-time distribution). The duration histogram MUST NOT be split by
`source_tag` — its bucket count would multiply without operational benefit.
- **FR-8.4** The collector MUST expose two parallel roll-ups labeled by `source_tag` only (not cross-producted with host):
`nginx_http_requests_by_source_total{source_tag}` and
`nginx_http_response_body_bytes_by_source{source_tag, le}`. These are separate metric names to avoid inconsistent label sets
under a single name.
- **FR-8.5** The collector MUST expose three counters that let operators distinguish UDP parse failures from back-pressure drops:
- **FR-8.3** The collector MUST expose two histograms keyed by `{host, source_tag}`:
`nginx_http_bytes_sent{host, source_tag, le}` (response wire-bytes distribution; FR-2.4) and
`nginx_http_request_duration_seconds{host, source_tag, le}` (end-to-end request time distribution).
Cardinality is bounded by `host × source_tag × bucket_count`, which is small enough that no explicit cap is required.
- **FR-8.4** The collector MUST expose three v2-only metrics that are populated only when v2 records arrive (and, for the
upstream metrics, only when nginx involved an upstream):
`nginx_http_request_bytes{host, source_tag, le}` from `$request_length`,
`nginx_http_upstream_duration_seconds{host, source_tag, le}` from `$upstream_response_time`, and
`nginx_http_upstream_requests_total{host, source_tag, status_class}` from `$upstream_status`. `status_class` is the
HTTP class of the upstream's status code, folded to `2xx`/`3xx`/`4xx`/`5xx`/`other`.
- **FR-8.5** The collector MUST expose a source-tag rollup counter
`nginx_http_requests_by_source_total{source_tag, status_class}`. `status_class` is the HTTP class of `$status`, folded
the same way as in FR-8.4. This rollup is intentionally not cross-producted with `host` — its purpose is fleet-wide
source-attribution health, not per-host detail.
- **FR-8.6** The collector MUST expose three counters that let operators distinguish UDP parse failures from back-pressure drops:
`logtail_udp_packets_received_total` (datagrams off the socket, one increment per `recvfrom`),
`logtail_udp_loglines_success_total` (log lines that parsed OK, incremented once per log line — a single batched datagram from
the nginx plugin may contribute many), and
@@ -279,13 +328,13 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
for a collector is therefore approximately 845 MB (live map ~19 MB + fine ring ~558 MB + coarse ring ~268 MB).
- **NFR-2.3** The aggregator MUST apply the same tier caps as the collector. Its steady-state memory is roughly equivalent to one
collector regardless of the number of collectors subscribed.
- **NFR-2.4** The Prometheus counter map (FR-8.2) MUST be capped at `promCounterCap = 250 000` entries. The per-host and per-source
histograms MUST NOT be capped explicitly — they grow only with the distinct host count, which is bounded by the operator's vhost
configuration.
- **NFR-2.4** The Prometheus counter map (FR-8.2) MUST be capped at `promCounterCap = 250 000` entries. The dual-labeled
`{host, source_tag}` histograms MUST NOT be capped explicitly — they grow only with the cross-product of distinct
hosts and distinct source tags, both bounded by the operator's nginx configuration.
**NFR-3 Performance**
- **NFR-3.1** `ParseLine` and `ParseUDPLine` MUST use `strings.Split` / `strings.SplitN` (no regex), so that per-line cost stays
- **NFR-3.1** `ParseLine` MUST use `strings.Split` / `strings.IndexByte` (no regex), so that per-line cost stays
around 50 ns on commodity hardware.
- **NFR-3.2** `TopN` and `Trend` queries across the full 24-hour coarse ring MUST complete in well under 250 ms at the 50 000-entry
fine cap, for fully-specified filters.
@@ -417,8 +466,10 @@ connected.
#### Key data types
- `LogRecord` — ten fields (website, client_prefix, URI, status, is_tor, asn, method, body_bytes_sent, request_time, source_tag).
Produced by `ParseLine` or `ParseUDPLine` and consumed by the store goroutine.
- `LogRecord` — fourteen fields (website, client_prefix, URI, status, is_tor, asn, method, bytes_sent, request_length,
request_time, upstream_response_time, upstream_status, has_upstream, source_tag). Produced by `ParseLine` (which
dispatches on the `v<N>\t` prefix) and consumed by the store goroutine. v1 records leave the v2-only fields
(`request_length`, upstream_*) at zero / false.
- `Tuple6` (historical name; carries seven fields now) — the aggregation key. NUL-separated when encoded as a map key for snapshots.
The code name is intentionally stable so downstream tests and consumers are not churned.
- `Snapshot` — `(timestamp, []Entry)` where `Entry = (label, count)` and `label` is an encoded `Tuple6`.
@@ -559,9 +610,10 @@ transitions. No per-request logging.
- **Frontend crash.** Stateless. Operator restarts.
- **UDP datagram loss.** Any datagram dropped in-kernel (socket buffer full, network drop) does not register as a parse failure; it
is simply invisible. Operators should size `SO_RCVBUF` appropriately; the collector already requests 4 MiB.
- **Malformed log lines.** File format: lines with <8 tab-separated fields are silently skipped; an invalid IP also drops the line.
UDP: packets without a recognised `v<N>\t` prefix, or with the wrong field count for the claimed version, or with a bad IP, are
counted as received-but-not-success and dropped.
- **Malformed log lines.** Both ingest paths use the versioned `v<N>\t` parser (FR-2). Lines without a recognised version
prefix, with the wrong field count for the claimed version, or with a bad IP are silently dropped. UDP drops are
visible as `packets_received_total - loglines_success_total`; file-path drops are implicit (the tailer simply moves
past them).
- **Clock skew between collectors.** Trend sparklines derived from merged data assume collectors are roughly NTP-synced. Per-bucket
alignment is to the local minute / 5-minute boundary of each collector.
- **gRPC traffic over untrusted links.** The system does not ship TLS; operators should front the gRPC ports with a TLS-terminating
@@ -588,11 +640,11 @@ transitions. No per-request logging.
- **pull-based collector polling (aggregator polls collectors every second).** Rejected in favor of push. Polling multiplies query
latency and makes the aggregator's cache stale by the poll interval. Push-stream with delta merge keeps the cache within seconds
of real time.
- **One metric name for both per-host and per-source_tag roll-ups.** Rejected for Prometheus hygiene. Mixing different label sets
under one metric name breaks aggregation rules; separate metric names (`_by_source`) are clearer and easier to query.
- **Cross-product of `host × source_tag` for every counter and histogram.** Rejected. With ~20 tags and ~50 hosts the cardinality
explodes quickly on the duration histogram without operational benefit. The duration histogram stays per-host; requests and body
size get a parallel `_by_source` rollup.
- **Separate `_by_source` metric names with a single label.** The original v0.2 layout exposed `_by_source` siblings to
avoid mixing label sets under one metric name. Superseded by the v0.3 layout: histograms now carry both `host` and
`source_tag` directly, and the source-tag rollup counter gains a `status_class` label. Cardinality stays bounded
(~7 hosts × ~6 tags × 11 buckets ≈ 460 series per histogram), and Grafana queries become simpler (`sum by(source_tag)`
rather than picking a different metric name).
- **Writing every `snapshot` to disk for restart recovery.** Rejected in favor of `DumpSnapshots` RPC backfill. Disk-backed
persistence would multiply operational surface (rotation, fsck, permissions) for a feature that needs to survive only an
aggregator restart.
+114 -92
View File
@@ -131,101 +131,98 @@ or for temporary overrides, without editing the unit.
The file is **not a dpkg conffile**: postinst writes it only when absent, so operator edits
survive upgrades, and `dpkg --purge` removes it.
### nginx — file-based ingest
### nginx — log format
Add the `logtail` format and attach it to whichever `server` blocks you want tracked:
Both ingest paths (file and UDP) use the same versioned tab-separated format. Every line MUST
begin with a literal `v1\t` or `v2\t` prefix; lines without a recognised prefix are dropped.
Two versions are defined; you can mix them across a fleet during a rollout (the collector
parses both).
```nginx
http {
log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn';
#### v2 (recommended)
server {
access_log /var/log/nginx/access.log logtail;
# or per-vhost:
access_log /var/log/nginx/www.example.com.access.log logtail;
}
}
```
Tab-separated, fixed field order, ten fields. The precise layout:
| # | Field | Ingested into |
|---|-------------------|--------------------------|
| 0 | `$host` | `website` |
| 1 | `$remote_addr` | `client_prefix` (truncated) |
| 2 | `$msec` | *(discarded)* |
| 3 | `$request_method` | Prom `method` label |
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
| 5 | `$status` | `http_response` |
| 6 | `$body_bytes_sent`| Prom body histogram |
| 7 | `$request_time` | Prom duration histogram |
| 8 | `$is_tor` | `is_tor` (optional) |
| 9 | `$asn` | `asn` (optional) |
`$is_tor` is `1` if the client IP is a TOR exit node and `0` otherwise (typically populated
via a Lua script or `$geoip2_data_*`). `$asn` is the client AS number as a decimal integer
(e.g. MaxMind GeoIP2's `$geoip2_data_autonomous_system_number`).
**If either is unknown, emit `0`.** A literal `0` in `$is_tor` parses as `false`; a literal
`0` in `$asn` parses as ASN `0`, which you can exclude at query time with `--asn '!=0'` / the
`asn!=0` filter expression. Operators who don't have TOR or GeoIP data can simply emit `0` for
both columns and everything works.
Both fields are also **positionally optional** for backward compatibility — older 8-field
lines are accepted and default to `false` / `0`. Records from the file tailer are always
tagged `source_tag="direct"`.
Then point the collector at the log files via `COLLECTOR_LOGS` — comma-separated paths or
glob patterns. Make sure the files are group-readable by `www-data` (the collector's primary
group in the systemd unit).
### nginx — UDP ingest (`nginx-ipng-stats-plugin`)
If the nginx host runs [`nginx-ipng-stats-plugin`](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin),
the plugin's `ipng_stats_logtail` directive emits one UDP datagram per request directly to
the collector, no log file involved. The wire format is **versioned** — every datagram starts
with a literal `v1\t` prefix so the collector can ship new parser versions (v2, v3, …) before
emitters are upgraded and route each packet accordingly.
v2 carries five operationally important fields v1 lacks: `$bytes_sent` (full wire bytes,
replaces `$body_bytes_sent`), `$request_length` (request size including headers),
`$upstream_response_time`, and `$upstream_status`. Together they let dashboards split
end-to-end latency into upstream vs. nginx overhead, attribute errors to the upstream vs. the
edge, and report ingress bandwidth.
```nginx
http {
log_format ipng_stats_logtail
'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
'v2\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t'
'$bytes_sent\t$request_length\t$request_time\t$upstream_response_time\t$upstream_status\t'
'$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
# File ingest:
server {
access_log /var/log/nginx/access.log ipng_stats_logtail;
}
# UDP ingest (nginx-ipng-stats-plugin):
ipng_stats_logtail ipng_stats_logtail udp://127.0.0.1:9514 buffer=64k flush=1s;
}
```
Precise v1 layout — 13 tab-separated fields total (version prefix + 12 payload fields):
| # | Field | Ingested into |
|---|---------------------------|---------------------------------------------------|
| 0 | `v2` | version tag |
| 1 | `$host` | `website` |
| 2 | `$remote_addr` | `client_prefix` (truncated) |
| 3 | `$request_method` | Prom `method` label |
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
| 5 | `$status` | `http_response` |
| 6 | `$bytes_sent` | Prom `nginx_http_bytes_sent` |
| 7 | `$request_length` | Prom `nginx_http_request_bytes` |
| 8 | `$request_time` | Prom `nginx_http_request_duration_seconds` |
| 9 | `$upstream_response_time` | Prom `nginx_http_upstream_duration_seconds` |
| 10| `$upstream_status` | Prom `nginx_http_upstream_requests_total` |
| 11| `$is_tor` | `is_tor` |
| 12| `$asn` | `asn` |
| 13| `$ipng_source_tag` | `source_tag` |
| 14| `$server_addr` | *(parsed and discarded)* |
| 15| `$scheme` | *(parsed and discarded)* |
| # | Field | Ingested into |
|---|-------------------|------------------------------|
| 0 | `v1` | version tag |
| 1 | `$host` | `website` |
| 2 | `$remote_addr` | `client_prefix` (truncated) |
| 3 | `$request_method` | Prom `method` label |
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
| 5 | `$status` | `http_response` |
| 6 | `$body_bytes_sent`| Prom body histogram |
| 7 | `$request_time` | Prom duration histogram |
| 8 | `$is_tor` | `is_tor` |
| 9 | `$asn` | `asn` |
| 10| `$ipng_source_tag`| `source_tag` |
| 11| `$server_addr` | *(parsed and discarded)* |
| 12| `$scheme` | *(parsed and discarded)* |
For requests served without an upstream (static files, redirects, errors), nginx emits
literal `-` for `$upstream_response_time` and `$upstream_status`; the parser treats those as
"no upstream" and skips the upstream metrics rather than counting them as zeros. When nginx
retries across multiple upstreams, both fields are comma-separated and the parser keeps the
last value (the upstream that ultimately served the response).
Compared to the file format: the version tag is added, `$msec` is dropped, and three fields
are appended — `$ipng_source_tag` (propagated into the data model), `$server_addr` and
`$scheme` (reserved for future use).
#### v1 (legacy)
**Unknown `$is_tor` / `$asn`: emit `0`.** Same convention as the file format — operators
without TOR or GeoIP data can emit `0` for both columns and everything works. A literal `0`
in `$is_tor` is `false`; a literal `0` in `$asn` is ASN `0`, filterable at query time.
v1 is preserved unchanged so existing emitters can be upgraded after the collector. Layout:
All 13 fields are required for v1 — malformed packets (wrong version, wrong field count, bad
IP) are silently dropped and counted via `logtail_udp_packets_received_total` minus
`logtail_udp_loglines_success_total`. Both paths (file + UDP) can feed the same collector
simultaneously; they converge on the same aggregation pipeline.
```nginx
log_format ipng_stats_logtail
'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t'
'$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
```
12 tab-separated payload fields after the `v1` prefix. v1 fills `nginx_http_bytes_sent` from
`$body_bytes_sent`; v2 fills it from `$bytes_sent`. Operators will see a small step up in
that metric (header overhead, typically a few hundred bytes per response) when emitters move
to v2.
#### Required values
`$is_tor` is `1` if the client IP is a TOR exit node and `0` otherwise (typically populated
via a Lua script or `$geoip2_data_*`). `$asn` is the client AS number as a decimal integer
(e.g. MaxMind GeoIP2's `$geoip2_data_autonomous_system_number`). Operators without TOR or
GeoIP data MUST emit literal `0` for both — a literal `0` in `$is_tor` parses as `false`; a
literal `0` in `$asn` is ASN `0`, filterable at query time with `--asn '!=0'`.
`$ipng_source_tag` is provided by [`nginx-ipng-stats-plugin`](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin).
Operators not running the plugin SHOULD declare a constant via `set $ipng_source_tag direct;`
in their `server` block — there is no synthesised fallback in the collector.
#### Pointing the collector at logs
For file ingest, set `COLLECTOR_LOGS` to comma-separated paths or glob patterns. Make sure
the files are group-readable by `www-data` (the collector's primary group in the systemd
unit). For UDP ingest, the plugin's `ipng_stats_logtail udp://127.0.0.1:9514` line above is
sufficient. Both paths can feed the same collector simultaneously and converge on the same
aggregation pipeline. Malformed lines (wrong version, wrong field count, bad IP) are silently
dropped; for UDP they show up as `logtail_udp_packets_received_total` minus
`logtail_udp_loglines_success_total`.
---
@@ -303,21 +300,33 @@ the new file appears. No restart or SIGHUP required.
The collector exposes a Prometheus-compatible `/metrics` endpoint on `--prom-listen` (default
`:9100`). Set `--prom-listen ""` to disable it entirely.
**Per-host series:**
**Per-{host,source_tag} series** (both v1 and v2):
- `nginx_http_requests_total{host, method, status}` — counter. Map capped at 250 000 distinct
label sets; new entries beyond the cap are dropped until the map is rolled over.
- `nginx_http_response_body_bytes_{bucket,count,sum}{host, le}` — histogram of
`$body_bytes_sent`. Buckets (bytes): `256, 1024, 4096, 16384, 65536, 262144, 1048576, +Inf`.
- `nginx_http_request_duration_seconds_{bucket,count,sum}{host, le}` — histogram of
`$request_time`. Buckets (seconds): `0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5,
10, +Inf`. Not split by `source_tag` (duration histogram stays per-host to avoid cardinality
blow-up).
- `nginx_http_bytes_sent_{bucket,count,sum}{host, source_tag, le}` — histogram of response
size. v1 fills from `$body_bytes_sent`; v2 fills from `$bytes_sent`. Buckets (bytes):
`256, 1024, 4096, 16384, 65536, 262144, 1048576, +Inf`.
- `nginx_http_request_duration_seconds_{bucket,count,sum}{host, source_tag, le}` — histogram
of `$request_time`. Buckets (seconds): `0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5,
5, 10, +Inf`.
**Per-`source_tag` roll-ups** (parallel series, not a cross-product with `host`):
**v2-only series** (populated only when v2 emitters are running, and the upstream histograms
only when nginx involved an upstream):
- `nginx_http_requests_by_source_total{source_tag}`counter.
- `nginx_http_response_body_bytes_by_source_{bucket,count,sum}{source_tag, le}` — histogram.
- `nginx_http_request_bytes_{bucket,count,sum}{host, source_tag, le}`histogram of
`$request_length` (ingress, headers + body). Same byte buckets as `bytes_sent`.
- `nginx_http_upstream_duration_seconds_{bucket,count,sum}{host, source_tag, le}`
histogram of `$upstream_response_time`. Same time buckets as `request_duration`. Lets you
split end-to-end latency into upstream vs. nginx overhead.
- `nginx_http_upstream_requests_total{host, source_tag, status_class}` — counter incremented
once per upstream-served request, classed by `$upstream_status` (`2xx`/`3xx`/`4xx`/`5xx`/`other`).
Lets you spot upstream errors masked at the edge (e.g. nginx 502 because origin 504).
**Source-tag rollup** (fleet-wide attribution health, intentionally not crossed with host):
- `nginx_http_requests_by_source_total{source_tag, status_class}` — counter classed by
`$status`. Use it to spot per-source error spikes without exploding cardinality.
**UDP ingest counters** — lets operators distinguish parse failures from back-pressure drops:
@@ -365,10 +374,23 @@ histogram_quantile(0.95,
sum by (host, le) (rate(nginx_http_request_duration_seconds_bucket[5m]))
)
# Median response body size per host
histogram_quantile(0.50,
sum by (host, le) (rate(nginx_http_response_body_bytes_bucket[5m]))
# 95th percentile response time per source_tag (drill in further as needed)
histogram_quantile(0.95,
sum by (source_tag, le) (rate(nginx_http_request_duration_seconds_bucket[5m]))
)
# Median response size per host
histogram_quantile(0.50,
sum by (host, le) (rate(nginx_http_bytes_sent_bucket[5m]))
)
# v2-only: upstream P95, split out from nginx overhead
histogram_quantile(0.95,
sum by (host, le) (rate(nginx_http_upstream_duration_seconds_bucket[5m]))
)
# v2-only: upstream 5xx rate per source_tag
sum by (source_tag) (rate(nginx_http_upstream_requests_total{status_class="5xx"}[5m]))
```
### Memory usage