RELEASE 1.0.1: v2 log format, source_tag-labeled metrics, lint cleanup

Wire-format and metric overhaul. Both file and UDP ingest now share one
versioned ParseLine that dispatches on the v<N>\t prefix; v1 stays
unchanged, v2 adds $bytes_sent (replacing $body_bytes_sent),
$request_length, $upstream_response_time, and $upstream_status. File
ingest gains the same versioning, and the legacy positional file format
is removed (no live deployments).

Prometheus exposition is rewritten:

  - nginx_http_bytes_sent and nginx_http_request_duration_seconds gain
    a source_tag label.
  - nginx_http_requests_by_source_total gains status_class.
  - New v2-only metrics: nginx_http_request_bytes,
    nginx_http_upstream_duration_seconds,
    nginx_http_upstream_requests_total{status_class}.
  - Dropped nginx_http_response_body_bytes_by_source (subsumed by the
    dual-labeled bytes_sent metric).

Adds 'make fixstyle' (gofmt -w) and clears all golangci-lint findings
across the repo (errcheck, S1001, ST1005, unused).

Docs in design.md FR-2/FR-8 and user-guide.md are rewritten to present
v2 as the recommended log format.
This commit is contained in:
2026-05-01 15:40:53 +02:00
parent d1a21a7a62
commit 6647f95be4
28 changed files with 931 additions and 724 deletions
+114 -92
View File
@@ -131,101 +131,98 @@ or for temporary overrides, without editing the unit.
The file is **not a dpkg conffile**: postinst writes it only when absent, so operator edits
survive upgrades, and `dpkg --purge` removes it.
### nginx — file-based ingest
### nginx — log format
Add the `logtail` format and attach it to whichever `server` blocks you want tracked:
Both ingest paths (file and UDP) use the same versioned tab-separated format. Every line MUST
begin with a literal `v1\t` or `v2\t` prefix; lines without a recognised prefix are dropped.
Two versions are defined; you can mix them across a fleet during a rollout (the collector
parses both).
```nginx
http {
log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn';
#### v2 (recommended)
server {
access_log /var/log/nginx/access.log logtail;
# or per-vhost:
access_log /var/log/nginx/www.example.com.access.log logtail;
}
}
```
Tab-separated, fixed field order, ten fields. The precise layout:
| # | Field | Ingested into |
|---|-------------------|--------------------------|
| 0 | `$host` | `website` |
| 1 | `$remote_addr` | `client_prefix` (truncated) |
| 2 | `$msec` | *(discarded)* |
| 3 | `$request_method` | Prom `method` label |
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
| 5 | `$status` | `http_response` |
| 6 | `$body_bytes_sent`| Prom body histogram |
| 7 | `$request_time` | Prom duration histogram |
| 8 | `$is_tor` | `is_tor` (optional) |
| 9 | `$asn` | `asn` (optional) |
`$is_tor` is `1` if the client IP is a TOR exit node and `0` otherwise (typically populated
via a Lua script or `$geoip2_data_*`). `$asn` is the client AS number as a decimal integer
(e.g. MaxMind GeoIP2's `$geoip2_data_autonomous_system_number`).
**If either is unknown, emit `0`.** A literal `0` in `$is_tor` parses as `false`; a literal
`0` in `$asn` parses as ASN `0`, which you can exclude at query time with `--asn '!=0'` / the
`asn!=0` filter expression. Operators who don't have TOR or GeoIP data can simply emit `0` for
both columns and everything works.
Both fields are also **positionally optional** for backward compatibility — older 8-field
lines are accepted and default to `false` / `0`. Records from the file tailer are always
tagged `source_tag="direct"`.
Then point the collector at the log files via `COLLECTOR_LOGS` — comma-separated paths or
glob patterns. Make sure the files are group-readable by `www-data` (the collector's primary
group in the systemd unit).
### nginx — UDP ingest (`nginx-ipng-stats-plugin`)
If the nginx host runs [`nginx-ipng-stats-plugin`](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin),
the plugin's `ipng_stats_logtail` directive emits one UDP datagram per request directly to
the collector, no log file involved. The wire format is **versioned** — every datagram starts
with a literal `v1\t` prefix so the collector can ship new parser versions (v2, v3, …) before
emitters are upgraded and route each packet accordingly.
v2 carries five operationally important fields v1 lacks: `$bytes_sent` (full wire bytes,
replaces `$body_bytes_sent`), `$request_length` (request size including headers),
`$upstream_response_time`, and `$upstream_status`. Together they let dashboards split
end-to-end latency into upstream vs. nginx overhead, attribute errors to the upstream vs. the
edge, and report ingress bandwidth.
```nginx
http {
log_format ipng_stats_logtail
'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
'v2\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t'
'$bytes_sent\t$request_length\t$request_time\t$upstream_response_time\t$upstream_status\t'
'$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
# File ingest:
server {
access_log /var/log/nginx/access.log ipng_stats_logtail;
}
# UDP ingest (nginx-ipng-stats-plugin):
ipng_stats_logtail ipng_stats_logtail udp://127.0.0.1:9514 buffer=64k flush=1s;
}
```
Precise v1 layout — 13 tab-separated fields total (version prefix + 12 payload fields):
| # | Field | Ingested into |
|---|---------------------------|---------------------------------------------------|
| 0 | `v2` | version tag |
| 1 | `$host` | `website` |
| 2 | `$remote_addr` | `client_prefix` (truncated) |
| 3 | `$request_method` | Prom `method` label |
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
| 5 | `$status` | `http_response` |
| 6 | `$bytes_sent` | Prom `nginx_http_bytes_sent` |
| 7 | `$request_length` | Prom `nginx_http_request_bytes` |
| 8 | `$request_time` | Prom `nginx_http_request_duration_seconds` |
| 9 | `$upstream_response_time` | Prom `nginx_http_upstream_duration_seconds` |
| 10| `$upstream_status` | Prom `nginx_http_upstream_requests_total` |
| 11| `$is_tor` | `is_tor` |
| 12| `$asn` | `asn` |
| 13| `$ipng_source_tag` | `source_tag` |
| 14| `$server_addr` | *(parsed and discarded)* |
| 15| `$scheme` | *(parsed and discarded)* |
| # | Field | Ingested into |
|---|-------------------|------------------------------|
| 0 | `v1` | version tag |
| 1 | `$host` | `website` |
| 2 | `$remote_addr` | `client_prefix` (truncated) |
| 3 | `$request_method` | Prom `method` label |
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
| 5 | `$status` | `http_response` |
| 6 | `$body_bytes_sent`| Prom body histogram |
| 7 | `$request_time` | Prom duration histogram |
| 8 | `$is_tor` | `is_tor` |
| 9 | `$asn` | `asn` |
| 10| `$ipng_source_tag`| `source_tag` |
| 11| `$server_addr` | *(parsed and discarded)* |
| 12| `$scheme` | *(parsed and discarded)* |
For requests served without an upstream (static files, redirects, errors), nginx emits
literal `-` for `$upstream_response_time` and `$upstream_status`; the parser treats those as
"no upstream" and skips the upstream metrics rather than counting them as zeros. When nginx
retries across multiple upstreams, both fields are comma-separated and the parser keeps the
last value (the upstream that ultimately served the response).
Compared to the file format: the version tag is added, `$msec` is dropped, and three fields
are appended — `$ipng_source_tag` (propagated into the data model), `$server_addr` and
`$scheme` (reserved for future use).
#### v1 (legacy)
**Unknown `$is_tor` / `$asn`: emit `0`.** Same convention as the file format — operators
without TOR or GeoIP data can emit `0` for both columns and everything works. A literal `0`
in `$is_tor` is `false`; a literal `0` in `$asn` is ASN `0`, filterable at query time.
v1 is preserved unchanged so existing emitters can be upgraded after the collector. Layout:
All 13 fields are required for v1 — malformed packets (wrong version, wrong field count, bad
IP) are silently dropped and counted via `logtail_udp_packets_received_total` minus
`logtail_udp_loglines_success_total`. Both paths (file + UDP) can feed the same collector
simultaneously; they converge on the same aggregation pipeline.
```nginx
log_format ipng_stats_logtail
'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t'
'$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
```
12 tab-separated payload fields after the `v1` prefix. v1 fills `nginx_http_bytes_sent` from
`$body_bytes_sent`; v2 fills it from `$bytes_sent`. Operators will see a small step up in
that metric (header overhead, typically a few hundred bytes per response) when emitters move
to v2.
#### Required values
`$is_tor` is `1` if the client IP is a TOR exit node and `0` otherwise (typically populated
via a Lua script or `$geoip2_data_*`). `$asn` is the client AS number as a decimal integer
(e.g. MaxMind GeoIP2's `$geoip2_data_autonomous_system_number`). Operators without TOR or
GeoIP data MUST emit literal `0` for both — a literal `0` in `$is_tor` parses as `false`; a
literal `0` in `$asn` is ASN `0`, filterable at query time with `--asn '!=0'`.
`$ipng_source_tag` is provided by [`nginx-ipng-stats-plugin`](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin).
Operators not running the plugin SHOULD declare a constant via `set $ipng_source_tag direct;`
in their `server` block — there is no synthesised fallback in the collector.
#### Pointing the collector at logs
For file ingest, set `COLLECTOR_LOGS` to comma-separated paths or glob patterns. Make sure
the files are group-readable by `www-data` (the collector's primary group in the systemd
unit). For UDP ingest, the plugin's `ipng_stats_logtail udp://127.0.0.1:9514` line above is
sufficient. Both paths can feed the same collector simultaneously and converge on the same
aggregation pipeline. Malformed lines (wrong version, wrong field count, bad IP) are silently
dropped; for UDP they show up as `logtail_udp_packets_received_total` minus
`logtail_udp_loglines_success_total`.
---
@@ -303,21 +300,33 @@ the new file appears. No restart or SIGHUP required.
The collector exposes a Prometheus-compatible `/metrics` endpoint on `--prom-listen` (default
`:9100`). Set `--prom-listen ""` to disable it entirely.
**Per-host series:**
**Per-{host,source_tag} series** (both v1 and v2):
- `nginx_http_requests_total{host, method, status}` — counter. Map capped at 250 000 distinct
label sets; new entries beyond the cap are dropped until the map is rolled over.
- `nginx_http_response_body_bytes_{bucket,count,sum}{host, le}` — histogram of
`$body_bytes_sent`. Buckets (bytes): `256, 1024, 4096, 16384, 65536, 262144, 1048576, +Inf`.
- `nginx_http_request_duration_seconds_{bucket,count,sum}{host, le}` — histogram of
`$request_time`. Buckets (seconds): `0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5,
10, +Inf`. Not split by `source_tag` (duration histogram stays per-host to avoid cardinality
blow-up).
- `nginx_http_bytes_sent_{bucket,count,sum}{host, source_tag, le}` — histogram of response
size. v1 fills from `$body_bytes_sent`; v2 fills from `$bytes_sent`. Buckets (bytes):
`256, 1024, 4096, 16384, 65536, 262144, 1048576, +Inf`.
- `nginx_http_request_duration_seconds_{bucket,count,sum}{host, source_tag, le}` — histogram
of `$request_time`. Buckets (seconds): `0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5,
5, 10, +Inf`.
**Per-`source_tag` roll-ups** (parallel series, not a cross-product with `host`):
**v2-only series** (populated only when v2 emitters are running, and the upstream histograms
only when nginx involved an upstream):
- `nginx_http_requests_by_source_total{source_tag}`counter.
- `nginx_http_response_body_bytes_by_source_{bucket,count,sum}{source_tag, le}` — histogram.
- `nginx_http_request_bytes_{bucket,count,sum}{host, source_tag, le}`histogram of
`$request_length` (ingress, headers + body). Same byte buckets as `bytes_sent`.
- `nginx_http_upstream_duration_seconds_{bucket,count,sum}{host, source_tag, le}`
histogram of `$upstream_response_time`. Same time buckets as `request_duration`. Lets you
split end-to-end latency into upstream vs. nginx overhead.
- `nginx_http_upstream_requests_total{host, source_tag, status_class}` — counter incremented
once per upstream-served request, classed by `$upstream_status` (`2xx`/`3xx`/`4xx`/`5xx`/`other`).
Lets you spot upstream errors masked at the edge (e.g. nginx 502 because origin 504).
**Source-tag rollup** (fleet-wide attribution health, intentionally not crossed with host):
- `nginx_http_requests_by_source_total{source_tag, status_class}` — counter classed by
`$status`. Use it to spot per-source error spikes without exploding cardinality.
**UDP ingest counters** — lets operators distinguish parse failures from back-pressure drops:
@@ -365,10 +374,23 @@ histogram_quantile(0.95,
sum by (host, le) (rate(nginx_http_request_duration_seconds_bucket[5m]))
)
# Median response body size per host
histogram_quantile(0.50,
sum by (host, le) (rate(nginx_http_response_body_bytes_bucket[5m]))
# 95th percentile response time per source_tag (drill in further as needed)
histogram_quantile(0.95,
sum by (source_tag, le) (rate(nginx_http_request_duration_seconds_bucket[5m]))
)
# Median response size per host
histogram_quantile(0.50,
sum by (host, le) (rate(nginx_http_bytes_sent_bucket[5m]))
)
# v2-only: upstream P95, split out from nginx overhead
histogram_quantile(0.95,
sum by (host, le) (rate(nginx_http_upstream_duration_seconds_bucket[5m]))
)
# v2-only: upstream 5xx rate per source_tag
sum by (source_tag) (rate(nginx_http_upstream_requests_total{status_class="5xx"}[5m]))
```
### Memory usage