Reduce scrape cardinality: class codes, per-(source,vip) histograms, byte histograms
Collapses the status-code dimension of the counter key into six class
lanes (1xx..5xx/unknown) so per-(source,vip) counter cardinality no
longer grows with the number of distinct three-digit responses nginx
serves. Histogram series drop the code label entirely and aggregate
across classes. Adds nginx_ipng_latency_total with a code class label
so average latency per class can still be computed off the scrape.
Adds nginx_ipng_bytes_{in,out} histograms with configurable boundaries
via the new ipng_stats_byte_buckets directive. Bumps JSON schema to 2.
Operators who need full three-digit-code resolution should consume the
ipng_stats_logtail stream off-host; the stats zone intentionally trades
that resolution for a bounded scrape size.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -183,16 +183,20 @@ curl -s http://127.0.0.1:9113/.well-known/ipng/statsz
|
||||
Default output is Prometheus text format:
|
||||
|
||||
```
|
||||
# HELP nginx_ipng_requests_total Total HTTP requests, per (source_tag, vip, code).
|
||||
# HELP nginx_ipng_requests_total Total HTTP requests.
|
||||
# TYPE nginx_ipng_requests_total counter
|
||||
nginx_ipng_requests_total{source_tag="mg1",vip="192.0.2.10",code="200"} 12345
|
||||
nginx_ipng_requests_total{source_tag="mg1",vip="192.0.2.10",code="404"} 17
|
||||
nginx_ipng_requests_total{source_tag="mg2",vip="192.0.2.10",code="200"} 9876
|
||||
nginx_ipng_requests_total{source_tag="direct",vip="192.0.2.10",code="200"} 42
|
||||
# HELP nginx_ipng_bytes_in_total Request bytes received, per (source_tag, vip, code).
|
||||
nginx_ipng_requests_total{source_tag="mg1",vip="192.0.2.10",code="2xx"} 12345
|
||||
nginx_ipng_requests_total{source_tag="mg1",vip="192.0.2.10",code="4xx"} 17
|
||||
nginx_ipng_requests_total{source_tag="mg2",vip="192.0.2.10",code="2xx"} 9876
|
||||
nginx_ipng_requests_total{source_tag="direct",vip="192.0.2.10",code="2xx"} 42
|
||||
# HELP nginx_ipng_bytes_in_total Request bytes received.
|
||||
# TYPE nginx_ipng_bytes_in_total counter
|
||||
nginx_ipng_bytes_in_total{source_tag="mg1",vip="192.0.2.10",code="200"} 9876543
|
||||
nginx_ipng_bytes_in_total{source_tag="mg1",vip="192.0.2.10",code="2xx"} 9876543
|
||||
# ... and so on
|
||||
|
||||
# Histogram series (request_duration, upstream_response, bytes_in, bytes_out)
|
||||
# do NOT carry a `code` label — they aggregate across classes per (source, vip).
|
||||
nginx_ipng_request_duration_seconds_bucket{source_tag="mg1",vip="192.0.2.10",le="0.050"} 11200
|
||||
```
|
||||
|
||||
For JSON output instead, set the `Accept` header:
|
||||
@@ -237,7 +241,7 @@ Typical PromQL queries:
|
||||
sum by (source_tag, vip) (rate(nginx_ipng_requests_total[1m]))
|
||||
|
||||
# 5xx error rate per VIP, aggregated across all sources:
|
||||
sum by (vip) (rate(nginx_ipng_requests_total{code=~"5.."}[5m]))
|
||||
sum by (vip) (rate(nginx_ipng_requests_total{code="5xx"}[5m]))
|
||||
/
|
||||
sum by (vip) (rate(nginx_ipng_requests_total[5m]))
|
||||
|
||||
@@ -252,6 +256,11 @@ Operators who want a single unified access log covering all traffic — regardle
|
||||
have to repeat `access_log` in every `server {}` block or rely on a catch-all virtual host. The `ipng_stats_logtail` directive removes
|
||||
that requirement: one line at the `http` level registers a global log-phase writer that fires unconditionally for every request (FR-8.1).
|
||||
|
||||
The logtail is also the recommended escape hatch when you need richer cardinality than the stats zone exposes. The Prometheus counters
|
||||
deliberately collapse HTTP status codes into six class lanes (`1xx`..`5xx`/`unknown`) to keep scrape size bounded. Operators who need
|
||||
per-three-digit-code, per-path, per-user-agent, or any other high-cardinality breakdown should ship the logtail stream to an off-path
|
||||
analytics receiver and compute those views there — that work happens in a different process and never touches the nginx hot path.
|
||||
|
||||
The logtail sends each buffer flush as a single UDP datagram to a `host:port`. Zero disk I/O, no backpressure, no blocking if the
|
||||
receiver is down. This makes it ideal for fire-and-forget analytics pipelines where delivery guarantees are unnecessary and disk writes
|
||||
would add unwanted I/O pressure. For file-based access logging, use nginx's built-in `access_log` directive.
|
||||
@@ -374,9 +383,10 @@ from any language.
|
||||
Once wired, a consumer can derive from the scrape data:
|
||||
|
||||
- Live QPS per backend (from the EWMA gauges).
|
||||
- Status-code mix per backend (from the counter families).
|
||||
- p50/p95 latency per backend (from the duration histogram).
|
||||
- Traffic volume per backend (from the bytes counters).
|
||||
- Status-class mix per backend (the six-lane `1xx`..`5xx`/`unknown` counter families). Full three-digit codes are not exported by the
|
||||
scrape endpoint; route the logtail stream off-host and aggregate there if you need per-code breakdowns.
|
||||
- p50/p95 latency per backend (from the duration histogram, aggregated across classes).
|
||||
- Traffic volume per backend (from the bytes counters and the new bytes histograms).
|
||||
|
||||
For an example of this pattern in a GRE tunnel fleet, see [`vpp-maglev`](https://git.ipng.ch/ipng/vpp-maglev), whose frontend scrapes
|
||||
each nginx backend filtered by source tag to show per-backend traffic alongside health state.
|
||||
@@ -393,7 +403,8 @@ values in `listens.conf`, or the interfaces aren't up. Run `ip -br link` and con
|
||||
`nginx.conf` is stable across reloads — renaming the zone forces a new shared-memory segment.
|
||||
|
||||
**`nginx_ipng_zone_full_events_total` is non-zero.** The shared-memory zone is too small for your VIP count. Increase the size in
|
||||
`ipng_stats_zone ipng:<size>` (default 4 MB is enough for ~hundreds of VIPs with the full status-code set).
|
||||
`ipng_stats_zone ipng:<size>` (default 4 MB is enough for ~hundreds of VIPs — the code dimension is bucketed to six classes, so
|
||||
one 4 MB zone holds a very large deployment).
|
||||
|
||||
**`curl http://127.0.0.1:9113/.well-known/ipng/statsz` returns "403 Forbidden".** The `allow`/`deny` ACL is blocking your source address. Either add
|
||||
yourself or scrape from a host already in the allow list.
|
||||
|
||||
Reference in New Issue
Block a user