RELEASE 1.0.1: v2 log format, source_tag-labeled metrics, lint cleanup
Wire-format and metric overhaul. Both file and UDP ingest now share one
versioned ParseLine that dispatches on the v<N>\t prefix; v1 stays
unchanged, v2 adds $bytes_sent (replacing $body_bytes_sent),
$request_length, $upstream_response_time, and $upstream_status. File
ingest gains the same versioning, and the legacy positional file format
is removed (no live deployments).
Prometheus exposition is rewritten:
- nginx_http_bytes_sent and nginx_http_request_duration_seconds gain
a source_tag label.
- nginx_http_requests_by_source_total gains status_class.
- New v2-only metrics: nginx_http_request_bytes,
nginx_http_upstream_duration_seconds,
nginx_http_upstream_requests_total{status_class}.
- Dropped nginx_http_response_body_bytes_by_source (subsumed by the
dual-labeled bytes_sent metric).
Adds 'make fixstyle' (gofmt -w) and clears all golangci-lint findings
across the repo (errcheck, S1001, ST1005, unused).
Docs in design.md FR-2/FR-8 and user-guide.md are rewritten to present
v2 as the recommended log format.
This commit is contained in:
+114
-92
@@ -131,101 +131,98 @@ or for temporary overrides, without editing the unit.
|
||||
The file is **not a dpkg conffile**: postinst writes it only when absent, so operator edits
|
||||
survive upgrades, and `dpkg --purge` removes it.
|
||||
|
||||
### nginx — file-based ingest
|
||||
### nginx — log format
|
||||
|
||||
Add the `logtail` format and attach it to whichever `server` blocks you want tracked:
|
||||
Both ingest paths (file and UDP) use the same versioned tab-separated format. Every line MUST
|
||||
begin with a literal `v1\t` or `v2\t` prefix; lines without a recognised prefix are dropped.
|
||||
Two versions are defined; you can mix them across a fleet during a rollout (the collector
|
||||
parses both).
|
||||
|
||||
```nginx
|
||||
http {
|
||||
log_format logtail '$host\t$remote_addr\t$msec\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn';
|
||||
#### v2 (recommended)
|
||||
|
||||
server {
|
||||
access_log /var/log/nginx/access.log logtail;
|
||||
# or per-vhost:
|
||||
access_log /var/log/nginx/www.example.com.access.log logtail;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Tab-separated, fixed field order, ten fields. The precise layout:
|
||||
|
||||
| # | Field | Ingested into |
|
||||
|---|-------------------|--------------------------|
|
||||
| 0 | `$host` | `website` |
|
||||
| 1 | `$remote_addr` | `client_prefix` (truncated) |
|
||||
| 2 | `$msec` | *(discarded)* |
|
||||
| 3 | `$request_method` | Prom `method` label |
|
||||
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
|
||||
| 5 | `$status` | `http_response` |
|
||||
| 6 | `$body_bytes_sent`| Prom body histogram |
|
||||
| 7 | `$request_time` | Prom duration histogram |
|
||||
| 8 | `$is_tor` | `is_tor` (optional) |
|
||||
| 9 | `$asn` | `asn` (optional) |
|
||||
|
||||
`$is_tor` is `1` if the client IP is a TOR exit node and `0` otherwise (typically populated
|
||||
via a Lua script or `$geoip2_data_*`). `$asn` is the client AS number as a decimal integer
|
||||
(e.g. MaxMind GeoIP2's `$geoip2_data_autonomous_system_number`).
|
||||
|
||||
**If either is unknown, emit `0`.** A literal `0` in `$is_tor` parses as `false`; a literal
|
||||
`0` in `$asn` parses as ASN `0`, which you can exclude at query time with `--asn '!=0'` / the
|
||||
`asn!=0` filter expression. Operators who don't have TOR or GeoIP data can simply emit `0` for
|
||||
both columns and everything works.
|
||||
|
||||
Both fields are also **positionally optional** for backward compatibility — older 8-field
|
||||
lines are accepted and default to `false` / `0`. Records from the file tailer are always
|
||||
tagged `source_tag="direct"`.
|
||||
|
||||
Then point the collector at the log files via `COLLECTOR_LOGS` — comma-separated paths or
|
||||
glob patterns. Make sure the files are group-readable by `www-data` (the collector's primary
|
||||
group in the systemd unit).
|
||||
|
||||
### nginx — UDP ingest (`nginx-ipng-stats-plugin`)
|
||||
|
||||
If the nginx host runs [`nginx-ipng-stats-plugin`](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin),
|
||||
the plugin's `ipng_stats_logtail` directive emits one UDP datagram per request directly to
|
||||
the collector, no log file involved. The wire format is **versioned** — every datagram starts
|
||||
with a literal `v1\t` prefix so the collector can ship new parser versions (v2, v3, …) before
|
||||
emitters are upgraded and route each packet accordingly.
|
||||
v2 carries five operationally important fields v1 lacks: `$bytes_sent` (full wire bytes,
|
||||
replaces `$body_bytes_sent`), `$request_length` (request size including headers),
|
||||
`$upstream_response_time`, and `$upstream_status`. Together they let dashboards split
|
||||
end-to-end latency into upstream vs. nginx overhead, attribute errors to the upstream vs. the
|
||||
edge, and report ingress bandwidth.
|
||||
|
||||
```nginx
|
||||
http {
|
||||
log_format ipng_stats_logtail
|
||||
'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
|
||||
'v2\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t'
|
||||
'$bytes_sent\t$request_length\t$request_time\t$upstream_response_time\t$upstream_status\t'
|
||||
'$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
|
||||
|
||||
# File ingest:
|
||||
server {
|
||||
access_log /var/log/nginx/access.log ipng_stats_logtail;
|
||||
}
|
||||
# UDP ingest (nginx-ipng-stats-plugin):
|
||||
ipng_stats_logtail ipng_stats_logtail udp://127.0.0.1:9514 buffer=64k flush=1s;
|
||||
}
|
||||
```
|
||||
|
||||
Precise v1 layout — 13 tab-separated fields total (version prefix + 12 payload fields):
|
||||
| # | Field | Ingested into |
|
||||
|---|---------------------------|---------------------------------------------------|
|
||||
| 0 | `v2` | version tag |
|
||||
| 1 | `$host` | `website` |
|
||||
| 2 | `$remote_addr` | `client_prefix` (truncated) |
|
||||
| 3 | `$request_method` | Prom `method` label |
|
||||
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
|
||||
| 5 | `$status` | `http_response` |
|
||||
| 6 | `$bytes_sent` | Prom `nginx_http_bytes_sent` |
|
||||
| 7 | `$request_length` | Prom `nginx_http_request_bytes` |
|
||||
| 8 | `$request_time` | Prom `nginx_http_request_duration_seconds` |
|
||||
| 9 | `$upstream_response_time` | Prom `nginx_http_upstream_duration_seconds` |
|
||||
| 10| `$upstream_status` | Prom `nginx_http_upstream_requests_total` |
|
||||
| 11| `$is_tor` | `is_tor` |
|
||||
| 12| `$asn` | `asn` |
|
||||
| 13| `$ipng_source_tag` | `source_tag` |
|
||||
| 14| `$server_addr` | *(parsed and discarded)* |
|
||||
| 15| `$scheme` | *(parsed and discarded)* |
|
||||
|
||||
| # | Field | Ingested into |
|
||||
|---|-------------------|------------------------------|
|
||||
| 0 | `v1` | version tag |
|
||||
| 1 | `$host` | `website` |
|
||||
| 2 | `$remote_addr` | `client_prefix` (truncated) |
|
||||
| 3 | `$request_method` | Prom `method` label |
|
||||
| 4 | `$request_uri` | `http_request_uri` (query stripped) |
|
||||
| 5 | `$status` | `http_response` |
|
||||
| 6 | `$body_bytes_sent`| Prom body histogram |
|
||||
| 7 | `$request_time` | Prom duration histogram |
|
||||
| 8 | `$is_tor` | `is_tor` |
|
||||
| 9 | `$asn` | `asn` |
|
||||
| 10| `$ipng_source_tag`| `source_tag` |
|
||||
| 11| `$server_addr` | *(parsed and discarded)* |
|
||||
| 12| `$scheme` | *(parsed and discarded)* |
|
||||
For requests served without an upstream (static files, redirects, errors), nginx emits
|
||||
literal `-` for `$upstream_response_time` and `$upstream_status`; the parser treats those as
|
||||
"no upstream" and skips the upstream metrics rather than counting them as zeros. When nginx
|
||||
retries across multiple upstreams, both fields are comma-separated and the parser keeps the
|
||||
last value (the upstream that ultimately served the response).
|
||||
|
||||
Compared to the file format: the version tag is added, `$msec` is dropped, and three fields
|
||||
are appended — `$ipng_source_tag` (propagated into the data model), `$server_addr` and
|
||||
`$scheme` (reserved for future use).
|
||||
#### v1 (legacy)
|
||||
|
||||
**Unknown `$is_tor` / `$asn`: emit `0`.** Same convention as the file format — operators
|
||||
without TOR or GeoIP data can emit `0` for both columns and everything works. A literal `0`
|
||||
in `$is_tor` is `false`; a literal `0` in `$asn` is ASN `0`, filterable at query time.
|
||||
v1 is preserved unchanged so existing emitters can be upgraded after the collector. Layout:
|
||||
|
||||
All 13 fields are required for v1 — malformed packets (wrong version, wrong field count, bad
|
||||
IP) are silently dropped and counted via `logtail_udp_packets_received_total` minus
|
||||
`logtail_udp_loglines_success_total`. Both paths (file + UDP) can feed the same collector
|
||||
simultaneously; they converge on the same aggregation pipeline.
|
||||
```nginx
|
||||
log_format ipng_stats_logtail
|
||||
'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status\t'
|
||||
'$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag\t$server_addr\t$scheme';
|
||||
```
|
||||
|
||||
12 tab-separated payload fields after the `v1` prefix. v1 fills `nginx_http_bytes_sent` from
|
||||
`$body_bytes_sent`; v2 fills it from `$bytes_sent`. Operators will see a small step up in
|
||||
that metric (header overhead, typically a few hundred bytes per response) when emitters move
|
||||
to v2.
|
||||
|
||||
#### Required values
|
||||
|
||||
`$is_tor` is `1` if the client IP is a TOR exit node and `0` otherwise (typically populated
|
||||
via a Lua script or `$geoip2_data_*`). `$asn` is the client AS number as a decimal integer
|
||||
(e.g. MaxMind GeoIP2's `$geoip2_data_autonomous_system_number`). Operators without TOR or
|
||||
GeoIP data MUST emit literal `0` for both — a literal `0` in `$is_tor` parses as `false`; a
|
||||
literal `0` in `$asn` is ASN `0`, filterable at query time with `--asn '!=0'`.
|
||||
|
||||
`$ipng_source_tag` is provided by [`nginx-ipng-stats-plugin`](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin).
|
||||
Operators not running the plugin SHOULD declare a constant via `set $ipng_source_tag direct;`
|
||||
in their `server` block — there is no synthesised fallback in the collector.
|
||||
|
||||
#### Pointing the collector at logs
|
||||
|
||||
For file ingest, set `COLLECTOR_LOGS` to comma-separated paths or glob patterns. Make sure
|
||||
the files are group-readable by `www-data` (the collector's primary group in the systemd
|
||||
unit). For UDP ingest, the plugin's `ipng_stats_logtail udp://127.0.0.1:9514` line above is
|
||||
sufficient. Both paths can feed the same collector simultaneously and converge on the same
|
||||
aggregation pipeline. Malformed lines (wrong version, wrong field count, bad IP) are silently
|
||||
dropped; for UDP they show up as `logtail_udp_packets_received_total` minus
|
||||
`logtail_udp_loglines_success_total`.
|
||||
|
||||
---
|
||||
|
||||
@@ -303,21 +300,33 @@ the new file appears. No restart or SIGHUP required.
|
||||
The collector exposes a Prometheus-compatible `/metrics` endpoint on `--prom-listen` (default
|
||||
`:9100`). Set `--prom-listen ""` to disable it entirely.
|
||||
|
||||
**Per-host series:**
|
||||
**Per-{host,source_tag} series** (both v1 and v2):
|
||||
|
||||
- `nginx_http_requests_total{host, method, status}` — counter. Map capped at 250 000 distinct
|
||||
label sets; new entries beyond the cap are dropped until the map is rolled over.
|
||||
- `nginx_http_response_body_bytes_{bucket,count,sum}{host, le}` — histogram of
|
||||
`$body_bytes_sent`. Buckets (bytes): `256, 1024, 4096, 16384, 65536, 262144, 1048576, +Inf`.
|
||||
- `nginx_http_request_duration_seconds_{bucket,count,sum}{host, le}` — histogram of
|
||||
`$request_time`. Buckets (seconds): `0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5,
|
||||
10, +Inf`. Not split by `source_tag` (duration histogram stays per-host to avoid cardinality
|
||||
blow-up).
|
||||
- `nginx_http_bytes_sent_{bucket,count,sum}{host, source_tag, le}` — histogram of response
|
||||
size. v1 fills from `$body_bytes_sent`; v2 fills from `$bytes_sent`. Buckets (bytes):
|
||||
`256, 1024, 4096, 16384, 65536, 262144, 1048576, +Inf`.
|
||||
- `nginx_http_request_duration_seconds_{bucket,count,sum}{host, source_tag, le}` — histogram
|
||||
of `$request_time`. Buckets (seconds): `0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5,
|
||||
5, 10, +Inf`.
|
||||
|
||||
**Per-`source_tag` roll-ups** (parallel series, not a cross-product with `host`):
|
||||
**v2-only series** (populated only when v2 emitters are running, and the upstream histograms
|
||||
only when nginx involved an upstream):
|
||||
|
||||
- `nginx_http_requests_by_source_total{source_tag}` — counter.
|
||||
- `nginx_http_response_body_bytes_by_source_{bucket,count,sum}{source_tag, le}` — histogram.
|
||||
- `nginx_http_request_bytes_{bucket,count,sum}{host, source_tag, le}` — histogram of
|
||||
`$request_length` (ingress, headers + body). Same byte buckets as `bytes_sent`.
|
||||
- `nginx_http_upstream_duration_seconds_{bucket,count,sum}{host, source_tag, le}` —
|
||||
histogram of `$upstream_response_time`. Same time buckets as `request_duration`. Lets you
|
||||
split end-to-end latency into upstream vs. nginx overhead.
|
||||
- `nginx_http_upstream_requests_total{host, source_tag, status_class}` — counter incremented
|
||||
once per upstream-served request, classed by `$upstream_status` (`2xx`/`3xx`/`4xx`/`5xx`/`other`).
|
||||
Lets you spot upstream errors masked at the edge (e.g. nginx 502 because origin 504).
|
||||
|
||||
**Source-tag rollup** (fleet-wide attribution health, intentionally not crossed with host):
|
||||
|
||||
- `nginx_http_requests_by_source_total{source_tag, status_class}` — counter classed by
|
||||
`$status`. Use it to spot per-source error spikes without exploding cardinality.
|
||||
|
||||
**UDP ingest counters** — lets operators distinguish parse failures from back-pressure drops:
|
||||
|
||||
@@ -365,10 +374,23 @@ histogram_quantile(0.95,
|
||||
sum by (host, le) (rate(nginx_http_request_duration_seconds_bucket[5m]))
|
||||
)
|
||||
|
||||
# Median response body size per host
|
||||
histogram_quantile(0.50,
|
||||
sum by (host, le) (rate(nginx_http_response_body_bytes_bucket[5m]))
|
||||
# 95th percentile response time per source_tag (drill in further as needed)
|
||||
histogram_quantile(0.95,
|
||||
sum by (source_tag, le) (rate(nginx_http_request_duration_seconds_bucket[5m]))
|
||||
)
|
||||
|
||||
# Median response size per host
|
||||
histogram_quantile(0.50,
|
||||
sum by (host, le) (rate(nginx_http_bytes_sent_bucket[5m]))
|
||||
)
|
||||
|
||||
# v2-only: upstream P95, split out from nginx overhead
|
||||
histogram_quantile(0.95,
|
||||
sum by (host, le) (rate(nginx_http_upstream_duration_seconds_bucket[5m]))
|
||||
)
|
||||
|
||||
# v2-only: upstream 5xx rate per source_tag
|
||||
sum by (source_tag) (rate(nginx_http_upstream_requests_total{status_class="5xx"}[5m]))
|
||||
```
|
||||
|
||||
### Memory usage
|
||||
|
||||
Reference in New Issue
Block a user