Reduce scrape cardinality: class codes, per-(source,vip) histograms, byte histograms

Collapses the status-code dimension of the counter key into six class lanes (1xx..5xx/unknown) so per-(source,vip) counter cardinality no longer grows with the number of distinct three-digit responses nginx serves. Histogram series drop the code label entirely and aggregate across classes. Adds nginx_ipng_latency_total with a code class label so average latency per class can still be computed off the scrape. Adds nginx_ipng_bytes_{in,out} histograms with configurable boundaries via the new ipng_stats_byte_buckets directive. Bumps JSON schema to 2. Operators who need full three-digit-code resolution should consume the ipng_stats_logtail stream off-host; the stats zone intentionally trades that resolution for a bounded scrape size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:36:16 +02:00
parent 87050bcf13
commit b3ad74cbde
5 changed files with 817 additions and 312 deletions
--- a/docs/config-guide.md
+++ b/docs/config-guide.md
@@ -108,6 +108,21 @@ same set applies to every `(source, vip)` key in the module (v0.1 does not suppo
 See FR-2.3, FR-5.4.
 ### `ipng_stats_byte_buckets <size> <size> ...`
 **Context:** `http`.
 **Value:** two or more strictly increasing sizes (nginx size spec: `100`, `1k`, `1m`, ...) representing byte-size histogram upper
 bounds.
 **Default:** `100 1000 10000 100000 1000000 10000000`, plus an implicit `+Inf` bucket.
 **Effect:** overrides the default bucket boundaries for the `nginx_ipng_bytes_in` and `nginx_ipng_bytes_out` histograms. Pick values
 that match your traffic mix — these bucket bounds feed the scrape output only, not the per-`(source, vip, class)` byte counters, which
 are exact.
 See FR-2.3.
 ### `ipng_stats on | off`
 **Context:** `http`, `server`, `location`.
@@ -231,17 +246,29 @@ See FR-3.1, FR-3.2, FR-3.3, FR-3.4, FR-3.5.
 For Prometheus, the module exports under the `nginx_ipng_` prefix.
 The `code` label is a class bucket — one of `1xx`, `2xx`, `3xx`, `4xx`, `5xx`, or `unknown` (for codes outside `[100, 599]`). This
 keeps per-`(source, vip)` counter cardinality bounded at six lanes regardless of how many distinct three-digit responses nginx serves.
 Histogram series do not carry `code` — they aggregate across all classes for a given `(source, vip)`. Operators who need a full
 per-three-digit-code breakdown should enable `ipng_stats_logtail` and derive it from the access-log stream off the hot path.
 | metric | type | labels | meaning |
 | --- | --- | --- | --- |
-| `nginx_ipng_requests_total` | counter | `source_tag`, `vip`, `code` | Request count per `(source, vip, status_code)`. |
+| `nginx_ipng_requests_total` | counter | `source_tag`, `vip`, `code` | Request count per `(source, vip, class)`. |
 | `nginx_ipng_bytes_in_total` | counter | `source_tag`, `vip`, `code` | Request bytes received (request line + headers + body). |
 | `nginx_ipng_bytes_out_total` | counter | `source_tag`, `vip`, `code` | Response bytes sent (status line + headers + body). |
-| `nginx_ipng_request_duration_seconds_bucket` | histogram bucket | `source_tag`, `vip`, `le` | Request duration histogram (Prometheus shape). |
+| `nginx_ipng_latency_total` | counter | `source_tag`, `vip`, `code` | Sum of request durations, in seconds. Divide by `_requests_total` for mean latency per class. |
 | `nginx_ipng_request_duration_seconds_bucket` | histogram bucket | `source_tag`, `vip`, `le` | Request duration histogram, aggregated across classes. |
 | `nginx_ipng_request_duration_seconds_sum` | histogram sum | `source_tag`, `vip` | Sum of observed durations in seconds. |
 | `nginx_ipng_request_duration_seconds_count` | histogram count | `source_tag`, `vip` | Count of observations. |
 | `nginx_ipng_upstream_response_seconds_bucket` | histogram bucket | `source_tag`, `vip`, `le` | Upstream response time histogram. |
 | `nginx_ipng_upstream_response_seconds_sum` | histogram sum | `source_tag`, `vip` | |
 | `nginx_ipng_upstream_response_seconds_count` | histogram count | `source_tag`, `vip` | |
 | `nginx_ipng_bytes_in_bucket` | histogram bucket | `source_tag`, `vip`, `le` | Request-size histogram (bytes). |
 | `nginx_ipng_bytes_in_sum` | histogram sum | `source_tag`, `vip` | Sum of request bytes (equals `bytes_in_total` summed over classes). |
 | `nginx_ipng_bytes_in_count` | histogram count | `source_tag`, `vip` | Observations. |
 | `nginx_ipng_bytes_out_bucket` | histogram bucket | `source_tag`, `vip`, `le` | Response-size histogram (bytes). |
 | `nginx_ipng_bytes_out_sum` | histogram sum | `source_tag`, `vip` | Sum of response bytes. |
 | `nginx_ipng_bytes_out_count` | histogram count | `source_tag`, `vip` | Observations. |
 | `nginx_ipng_rate_1s` | gauge | `source_tag`, `vip` | EWMA requests/sec, 1-second decay. |
 | `nginx_ipng_rate_10s` | gauge | `source_tag`, `vip` | EWMA requests/sec, 10-second decay. |
 | `nginx_ipng_rate_60s` | gauge | `source_tag`, `vip` | EWMA requests/sec, 60-second decay. |
@@ -258,39 +285,31 @@ See FR-2.*, FR-3.7.
 ```json
 {
-  "schema": 1,
+  "schema": 2,
-  "by_source": {
+  "records": [
-    "mg1": {
+    {
-      "vips": {
+      "source_tag": "mg1",
-        "192.0.2.10": {
+      "vip": "192.0.2.10",
-          "rate_1s": 42.3,
+      "classes": {
-          "rate_10s": 40.1,
+        "2xx": { "requests": 12345, "bytes_in": 9876543, "bytes_out": 54321098,
-          "rate_60s": 39.8,
+                 "latency_ms": 87654, "upstream_latency_ms": 61234 },
-          "codes": {
+        "4xx": { "requests": 17, "bytes_in": 2048, "bytes_out": 9216,
-            "200": { "requests": 12345, "bytes_in": 9876543, "bytes_out": 54321098 },
+                 "latency_ms": 102, "upstream_latency_ms": 0 }
            "404": { "requests": 17,    "bytes_in": 2048,    "bytes_out": 9216 }
      },
      "request_duration_ms": {
-            "buckets": { "1": 10, "5": 40, "10": 120, "25": 350, "50": 870, "100": 2100,
+        "sum": 87756, "count": 12362,
-                         "250": 3400, "500": 4000, "1000": 4100, "2500": 4120,
+        "buckets": { "1": 10, "5": 40, "10": 120, "+Inf": 12362 }
                         "5000": 4123, "10000": 4124, "+Inf": 4124 },
            "sum_ms": 87654,
            "count": 4124
      },
-          "upstream_response_ms": { "...": "..." }
+      "upstream_response_ms": { "sum": 61234, "count": 12345, "buckets": { "...": "..." } },
-        }
+      "bytes_in":  { "count": 12362, "buckets": { "100": 200, "1000": 9000, "+Inf": 12362 } },
-      }
+      "bytes_out": { "count": 12362, "buckets": { "...": "..." } }
    }
  },
  "meta": {
    "zone_bytes_used": 131072,
    "zone_bytes_total": 4194304,
    "zone_full_events": 0
    }
  ]
 }
 ```
-The top-level `schema` field is versioned — breaking changes bump it, additive changes don't. Consumers SHOULD check `schema`
+The top-level `schema` field is versioned — breaking changes bump it, additive changes don't. Schema `2` collapses status codes to
 class buckets and moves histograms out of the per-class records to a per-`(source, vip)` record. Consumers SHOULD check `schema`
 before parsing.
 See FR-3.6.
@@ -303,6 +322,7 @@ See FR-3.6.
 | `ipng_stats_flush_interval` | ✅ | — | — | — |
 | `ipng_stats_default_source` | ✅ | — | — | — |
 | `ipng_stats_buckets` | ✅ | — | — | — |
 | `ipng_stats_byte_buckets` | ✅ | — | — | — |
 | `ipng_stats_logtail` | ✅ | — | — | — |
 | `ipng_stats on\|off` | ✅ | ✅ | ✅ | — |
 | `ipng_stats;` (handler) | — | — | ✅ | — |
--- a/docs/design.md
+++ b/docs/design.md
@@ -90,14 +90,18 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
 **FR-2 Counters**
- **FR-2.1** The module MUST maintain, for every observed `(source, vip, status_code)` tuple, the following counters: total requests,
+- **FR-2.1** The module MUST maintain, for every observed `(source, vip, status_class)` tuple, the following counters: total requests,
  total bytes received (sum of request bytes including request line, headers, and body), total bytes sent (sum of response bytes
-  including status line, headers, and body), and a fixed-bucket histogram of request duration in milliseconds.
+  including status line, headers, and body), and sum of request durations in milliseconds (exported as `nginx_ipng_latency_total`).
  The module MUST additionally maintain, per `(source, vip)` pair (no `code` label), fixed-bucket histograms of request duration in
  milliseconds and of request/response sizes in bytes.
 - **FR-2.2** When an upstream is used to serve the request, the module MUST additionally maintain a fixed-bucket histogram of upstream
  response time in milliseconds, keyed by the same `(source, vip)` pair.
- **FR-2.3** The histogram bucket boundaries MUST be fixed at module initialization and MUST be the same for every `(source, vip)` key.
+- **FR-2.3** The duration histogram bucket boundaries MUST be fixed at module initialization and MUST be the same for every `(source,
-  The default boundaries are `{1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000}` milliseconds plus an implicit `+Inf` bucket.
+  vip)` key. The default boundaries are `{1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000}` milliseconds plus an implicit
-  Operators MAY override the boundaries via the `ipng_stats_buckets` directive at the `http` level.
+  `+Inf` bucket. Operators MAY override the boundaries via the `ipng_stats_buckets` directive at the `http` level. The byte-size
  histograms (request and response bodies) use independent bounds defaulting to `{100, 1000, 10000, 100000, 1000000, 10000000}` bytes;
  `ipng_stats_byte_buckets` overrides them.
 - **FR-2.4** The module MUST additionally maintain, per `(source, vip)` pair, exponentially-weighted moving averages for instantaneous
  request rate with decay windows of 1 second, 10 seconds, and 60 seconds. EWMAs are updated from the periodic flush tick (see FR-4.2),
  not from the request path.
@@ -105,8 +109,11 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
  IPv4, RFC 5952 lowercase-compressed form for IPv6). IPv6 zone identifiers (scope-ids), if any, MUST be stripped during canonicalization;
  link-local VIPs (which are not expected in practice) are attributed under their scope-less textual form. Port is not part of the key;
  a VIP that listens on both 80 and 443 MUST be aggregated.
- **FR-2.6** The `status_code` dimension MUST be the full three-digit HTTP status code as recorded by nginx at log phase. The module MUST
+- **FR-2.6** The `status_code` dimension MUST be bucketed into a single class label: `1xx`, `2xx`, `3xx`, `4xx`, `5xx`, or `unknown` for
-  NOT bucket codes into classes (2xx/3xx/4xx/5xx); bucketing is the consumer's job.
+  codes outside `[100, 599]`. This bounds per-`(source, vip)` cardinality to six lanes regardless of how many distinct three-digit
  codes are observed. Operators who need a full per-code breakdown SHOULD enable `ipng_stats_logtail` (FR-8) and derive the per-code
  view from the access-log stream off the hot path; the stats zone intentionally trades that resolution for a much smaller scrape
  response.
 **FR-3 Scrape endpoint**
@@ -122,8 +129,10 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
 - **FR-3.5** Both filters MAY be supplied together; their effect is the intersection.
 - **FR-3.6** The JSON schema MUST be documented in `docs/scrape-api.md` and MUST version via a top-level `schema` field so that breaking
  changes can be made additively without bricking existing consumers.
- **FR-3.7** The Prometheus text output MUST use stable metric names prefixed with `nginx_ipng_` and MUST label every series with `source_tag`
+- **FR-3.7** The Prometheus text output MUST use stable metric names prefixed with `nginx_ipng_` and MUST label every series with
-  and `vip`. Counter metrics additionally carry a `code` label.
+  `source_tag` and `vip`. Counter metrics (`nginx_ipng_requests_total`, `nginx_ipng_bytes_{in,out}_total`, `nginx_ipng_latency_total`)
  additionally carry a `code` label with a class value (`1xx`..`5xx`/`unknown`). Histogram series (duration, upstream response,
  request/response byte size) MUST NOT carry a `code` label — they aggregate across all classes for a given `(source, vip)` pair.
 **FR-4 Hot path and flush**
@@ -220,7 +229,7 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
  (cached on the connection struct), a constant-time status-code index computation, a constant number of integer increments, and a
  `O(log B)` histogram binary search where `B` is the number of buckets. No syscalls, no allocations, no locks.
 - **NFR-2.2** The per-flush cost per worker MUST be bounded by `O(K)` atomic adds, where `K` is the number of distinct
-  `(source, vip, code)` keys touched by that worker since the last flush. Keys untouched during an interval MUST NOT be visited.
+  `(source, vip, class)` keys touched by that worker since the last flush. Keys untouched during an interval MUST NOT be visited.
 - **NFR-2.3** The scrape cost MUST be bounded by `O(K_total)` reads from the shared zone plus `O(K_total)` string format operations,
  where `K_total` is the number of distinct keys in the zone.
@@ -231,9 +240,9 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
  level no more than once per minute per worker.
 - **NFR-3.2** The per-worker private counter table MUST be bounded by the same total key count the shared zone admits. A worker MUST NOT
  accumulate private state that exceeds the shared-zone capacity.
- **NFR-3.3** The set of distinct status codes observed is small (typically ≤ 60) and MUST NOT be allowed to explode due to non-standard
+- **NFR-3.3** Status codes are collapsed to six classes (`1xx`..`5xx`/`unknown`) at counter-update time (FR-2.6), bounding per-`(source,
-  responses; the module MUST clamp any observed code `< 100` or `>= 600` into a single bucket labeled `code="unknown"` rather than
+  vip)` counter cardinality at six lanes regardless of how many three-digit codes are observed. Any code outside `[100, 599]` falls
-  allocating a new key.
+  into `code="unknown"`. Per-code resolution is available via `ipng_stats_logtail` (FR-8), which operates off the hot path.
 **NFR-4 Reload neutrality**
@@ -332,7 +341,7 @@ dynamic-module ABI.
 - Parse new `listen` parameters `device=` and `ipng_source_tag=` and attach their values to each listening socket's config (FR-1.1, FR-1.2).
 - Call `setsockopt(SO_BINDTODEVICE)` in the master process at bind time for listeners that set `device=` (FR-1.1, NFR-6.1).
- Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_code)` (FR-2.1, NFR-1.1).
+- Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_class)` (FR-2.1, NFR-1.1).
 - Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2).
 - Update EWMAs at flush time (FR-2.4).
 - Serve the scrape endpoint with content negotiation and optional filters (FR-3).
@@ -359,17 +368,22 @@ such an interface is silently misattributed to that interface's source tag. This
 #### Counter Data Model
-Counters are stored as a flat hash table in a shared-memory zone. The key is the tuple `(source_id, vip_id, status_code)` where
+Counters are stored as a flat hash table in a shared-memory zone. The key is the tuple `(source_id, vip_id, status_class)` where
-`source_id` and `vip_id` are small integers assigned at first observation and reused thereafter. The value is a fixed-size record
+`source_id` and `vip_id` are small integers assigned at first observation and `status_class` is one of six values (`0=unknown`,
-containing:
+`1..5` for `1xx`..`5xx`). The value is a fixed-size record containing:
 - `requests` (u64)
 - `bytes_in` (u64)
 - `bytes_out` (u64)
- `duration_hist` — `B+1` u64 lanes (one per bucket plus the `+Inf` bucket)
+- `duration_sum_ms` (u64) — exported as `nginx_ipng_latency_total` (per class)
 - `duration_sum_ms` (u64)
 - `upstream_hist` — same shape, only updated when an upstream served the request
 - `upstream_sum_ms` (u64)
 - `duration_hist` — `B+1` u64 lanes (one per bucket plus the `+Inf` bucket)
 - `upstream_hist` — same shape, only updated when an upstream served the request
 - `bytes_in_hist`, `bytes_out_hist` — `Bb+1` u64 lanes over the byte-size bucket bounds
 Histogram lanes are kept per `(source, vip, class)` in storage, then summed across classes at scrape time to produce one
 `_bucket`/`_sum`/`_count` series per `(source, vip)` — the Prometheus exposition never carries a `code` label on histogram series
 (FR-3.7).
 A parallel table keyed by `(source_id, vip_id)` — one row per VIP — holds the EWMAs for instantaneous rate. EWMAs are floats but updated
 only from the flush tick, so there is no float contention on the request path.
@@ -379,8 +393,9 @@ endpoint can recover the original strings without re-parsing configuration.
 String interning is capacity-bounded: the zone is sized by the operator, and once capacity is exhausted new keys are dropped with a
 counter bump and an infrequent log line (NFR-3.1). In practice, the number of distinct VIPs on a single nginx host is small (tens, maybe
-low hundreds), and the number of distinct source tags is the number of attributed interfaces (single digits). The dominant factor is
+low hundreds), and the number of distinct source tags is the number of attributed interfaces (single digits). Because status codes are
-`status_code`; ~60 keys per VIP is a typical steady state.
+collapsed to six classes (FR-2.6), the `status_class` dimension contributes at most 6× the `(source, vip)` count — a ~10× reduction
 from the per-three-digit-code model considered and discarded.
 #### Hot Path
@@ -393,7 +408,7 @@ ipng_stats_log_handler(ngx_http_request_t *r)
    ipng_listen_ctx_t  *lctx;
    ipng_counter_t     *counter;
    ngx_msec_int_t      elapsed_ms;
-    ngx_uint_t          code_idx;
+    ngx_uint_t          class_idx;
    if (!ipng_stats_enabled(r)) {
        return NGX_OK;
@@ -403,8 +418,8 @@ ipng_stats_log_handler(ngx_http_request_t *r)
    /* lctx contains source_id and the cached VIP id,
       or resolves VIP lazily on first seen address */
-    code_idx = ipng_status_to_index(r->headers_out.status);
+    class_idx = ipng_status_to_class(r->headers_out.status);  /* 0..5 */
-    counter  = ipng_worker_slot(lctx, r->connection->local_sockaddr, code_idx);
+    counter   = ipng_worker_slot(lctx, r->connection->local_sockaddr, class_idx);
    counter->requests++;
    counter->bytes_in  += r->request_length;
@@ -425,7 +440,7 @@ ipng_stats_log_handler(ngx_http_request_t *r)
 ```
 Nothing here touches shared memory. `ipng_worker_slot` resolves a private table slot using a small per-worker hash keyed by
-`(source_id, vip_id, code_idx)`. VIP lookup is cached on the connection so that keep-alive requests reuse the resolved ID.
+`(source_id, vip_id, class_idx)`. VIP lookup is cached on the connection so that keep-alive requests reuse the resolved ID.
 #### Flush Timer
@@ -459,9 +474,10 @@ fixed-size buffer per chain link and requests new links only when full.
 - **One nginx content handler**, `ipng_stats`, usable in any `location` block. Serves Prometheus text and JSON, filtered by optional
  query parameters.
 - **Two new `listen` parameters**, `device=` and `ipng_source_tag=`, usable anywhere a `listen` directive is used.
- **Five new `http`-level directives**: `ipng_stats_zone`, `ipng_stats_flush_interval`, `ipng_stats_default_source`,
+- **Six new `http`-level directives**: `ipng_stats_zone`, `ipng_stats_flush_interval`, `ipng_stats_default_source`,
-  `ipng_stats_buckets`, `ipng_stats` (on/off).
+  `ipng_stats_buckets`, `ipng_stats_byte_buckets`, `ipng_stats` (on/off).
- **A Prometheus metric family** prefixed `nginx_ipng_*`, labelled `source_tag`, `vip`, and (for request counters) `code`.
+- **A Prometheus metric family** prefixed `nginx_ipng_*`, labelled `source_tag`, `vip`, and (for counter metrics) a `code` class label
  (`1xx`..`5xx`/`unknown`).
 **Consumes.**
--- a/docs/user-guide.md
+++ b/docs/user-guide.md
@@ -183,16 +183,20 @@ curl -s http://127.0.0.1:9113/.well-known/ipng/statsz
 Default output is Prometheus text format:
 ```
-# HELP nginx_ipng_requests_total Total HTTP requests, per (source_tag, vip, code).
+# HELP nginx_ipng_requests_total Total HTTP requests.
 # TYPE nginx_ipng_requests_total counter
-nginx_ipng_requests_total{source_tag="mg1",vip="192.0.2.10",code="200"} 12345
+nginx_ipng_requests_total{source_tag="mg1",vip="192.0.2.10",code="2xx"} 12345
-nginx_ipng_requests_total{source_tag="mg1",vip="192.0.2.10",code="404"} 17
+nginx_ipng_requests_total{source_tag="mg1",vip="192.0.2.10",code="4xx"} 17
-nginx_ipng_requests_total{source_tag="mg2",vip="192.0.2.10",code="200"} 9876
+nginx_ipng_requests_total{source_tag="mg2",vip="192.0.2.10",code="2xx"} 9876
-nginx_ipng_requests_total{source_tag="direct",vip="192.0.2.10",code="200"} 42
+nginx_ipng_requests_total{source_tag="direct",vip="192.0.2.10",code="2xx"} 42
-# HELP nginx_ipng_bytes_in_total Request bytes received, per (source_tag, vip, code).
+# HELP nginx_ipng_bytes_in_total Request bytes received.
 # TYPE nginx_ipng_bytes_in_total counter
-nginx_ipng_bytes_in_total{source_tag="mg1",vip="192.0.2.10",code="200"} 9876543
+nginx_ipng_bytes_in_total{source_tag="mg1",vip="192.0.2.10",code="2xx"} 9876543
 # ... and so on
 # Histogram series (request_duration, upstream_response, bytes_in, bytes_out)
 # do NOT carry a `code` label — they aggregate across classes per (source, vip).
 nginx_ipng_request_duration_seconds_bucket{source_tag="mg1",vip="192.0.2.10",le="0.050"} 11200
 ```
 For JSON output instead, set the `Accept` header:
@@ -237,7 +241,7 @@ Typical PromQL queries:
 sum by (source_tag, vip) (rate(nginx_ipng_requests_total[1m]))
 # 5xx error rate per VIP, aggregated across all sources:
-sum by (vip) (rate(nginx_ipng_requests_total{code=~"5.."}[5m]))
+sum by (vip) (rate(nginx_ipng_requests_total{code="5xx"}[5m]))
  /
 sum by (vip) (rate(nginx_ipng_requests_total[5m]))
@@ -252,6 +256,11 @@ Operators who want a single unified access log covering all traffic — regardle
 have to repeat `access_log` in every `server {}` block or rely on a catch-all virtual host. The `ipng_stats_logtail` directive removes
 that requirement: one line at the `http` level registers a global log-phase writer that fires unconditionally for every request (FR-8.1).
 The logtail is also the recommended escape hatch when you need richer cardinality than the stats zone exposes. The Prometheus counters
 deliberately collapse HTTP status codes into six class lanes (`1xx`..`5xx`/`unknown`) to keep scrape size bounded. Operators who need
 per-three-digit-code, per-path, per-user-agent, or any other high-cardinality breakdown should ship the logtail stream to an off-path
 analytics receiver and compute those views there — that work happens in a different process and never touches the nginx hot path.
 The logtail sends each buffer flush as a single UDP datagram to a `host:port`. Zero disk I/O, no backpressure, no blocking if the
 receiver is down. This makes it ideal for fire-and-forget analytics pipelines where delivery guarantees are unnecessary and disk writes
 would add unwanted I/O pressure. For file-based access logging, use nginx's built-in `access_log` directive.
@@ -374,9 +383,10 @@ from any language.
 Once wired, a consumer can derive from the scrape data:
 - Live QPS per backend (from the EWMA gauges).
- Status-code mix per backend (from the counter families).
+- Status-class mix per backend (the six-lane `1xx`..`5xx`/`unknown` counter families). Full three-digit codes are not exported by the
- p50/p95 latency per backend (from the duration histogram).
+  scrape endpoint; route the logtail stream off-host and aggregate there if you need per-code breakdowns.
- Traffic volume per backend (from the bytes counters).
+- p50/p95 latency per backend (from the duration histogram, aggregated across classes).
 - Traffic volume per backend (from the bytes counters and the new bytes histograms).
 For an example of this pattern in a GRE tunnel fleet, see [`vpp-maglev`](https://git.ipng.ch/ipng/vpp-maglev), whose frontend scrapes
 each nginx backend filtered by source tag to show per-backend traffic alongside health state.
@@ -393,7 +403,8 @@ values in `listens.conf`, or the interfaces aren't up. Run `ip -br link` and con
 `nginx.conf` is stable across reloads — renaming the zone forces a new shared-memory segment.
 **`nginx_ipng_zone_full_events_total` is non-zero.** The shared-memory zone is too small for your VIP count. Increase the size in
-`ipng_stats_zone ipng:<size>` (default 4 MB is enough for ~hundreds of VIPs with the full status-code set).
+`ipng_stats_zone ipng:<size>` (default 4 MB is enough for ~hundreds of VIPs — the code dimension is bucketed to six classes, so
 one 4 MB zone holds a very large deployment).
 **`curl http://127.0.0.1:9113/.well-known/ipng/statsz` returns "403 Forbidden".** The `allow`/`deny` ACL is blocking your source address. Either add
 yourself or scrape from a host already in the allow list.
--- a/src/ngx_http_ipng_stats_module.c
+++ b/src/ngx_http_ipng_stats_module.c
--- a/tests/01-module/01-e2e.robot
+++ b/tests/01-module/01-e2e.robot
@@ -71,14 +71,14 @@ Direct traffic tagged
 # --- Status code tracking ---
-Per-code counters
+Per-class code counters
-    [Documentation]    404 and 200 appear as distinct code= labels.
+    [Documentation]    4xx and 2xx appear as class-bucketed code= labels.
    Docker Exec Ignore Rc    ${CLIENT1}    curl -s http://10.0.1.1:8080/notfound
    Docker Exec Ignore Rc    ${CLIENT1}    curl -s http://10.0.1.1:8080/notfound
    Wait For Flush
    ${output} =    Scrape With Filter    source_tag=cl1
-    Should Contain    ${output}    code="404"
+    Should Contain    ${output}    code="4xx"
-    Should Contain    ${output}    code="200"
+    Should Contain    ${output}    code="2xx"
 # --- Duration histogram ---