Files
nginx-ipng-stats-plugin/docs/design.md
Pim van Pelt 055cf9f830 Update docs and header comments for the IP_PKTINFO attribution model
The SO_BINDTODEVICE → IP_PKTINFO switch in the previous commit
was a semantic change: the module no longer touches outgoing
routing at all, and several places in the docs and the module's
top-of-file comment still described the old mechanism.

- README.md and debian/control now describe attribution as
  reading the ingress ifindex per connection from the kernel's
  IP_PKTINFO / IPV6_PKTINFO cmsg, and explicitly call out that
  the DSR / maglev return-path constraint is what makes the
  change necessary.
- docs/design.md FR-1.1 / FR-1.5 / FR-1.6 are rewritten to
  forbid SO_BINDTODEVICE and to describe the cmsg-based lookup.
  NFR-6.1 notes these are ordinary unprivileged socket options.
  The "Components" / "Composes With" sections and the
  "Alternatives Considered" entry are brought in line — and a
  new entry records SO_BINDTODEVICE as a rejected alternative
  with the exact failure mode seen on an IPng production box.
- docs/config-guide.md already carried the new description;
  unchanged here.
- src/ngx_http_ipng_stats_module.c's top-level block comment is
  rewritten to match; the section header above init_module goes
  from "rebind listen sockets with SO_BINDTODEVICE" to "enable
  IP_PKTINFO on listen sockets, resolve ifindexes".

Three SO_BINDTODEVICE mentions deliberately remain in the source
and one in the design doc's alternatives table — all of them
explain that the module *avoids* the option, which is itself
load-bearing documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:16:10 +02:00

679 lines
47 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!-- SPDX-License-Identifier: Apache-2.0 -->
# nginx-ipng-stats-plugin Design Document
## Metadata
| | |
| --- | --- |
| **Status** | Draft — describes intended behavior for `v0.2.0` |
| **Author** | Pim van Pelt `<pim@ipng.ch>` |
| **Last updated** | 2026-04-16 |
| **Audience** | Operators and contributors deploying per-device, per-VIP traffic observability on nginx |
The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and **MAY** are used as described in
[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are reserved in this document for requirements that are intended to be
enforced in code or by an external dependency. Plain-language descriptions of what the system or an operator can do are written in
lowercase — "can", "will", "does" — and should not be read as normative.
## Summary
`nginx-ipng-stats-plugin` is a dynamic nginx module and its surrounding Debian packaging. Loaded into stock upstream nginx, the module
records per-VIP traffic counters — requests, status codes, bytes, latency — and attributes them to the specific interface on which each
connection arrived. A small HTTP scrape endpoint exposes the counters as both Prometheus text and JSON so that Prometheus, custom
dashboards, and ad-hoc `curl` sessions can all read the same data.
## Background
Any deployment where traffic arrives on distinct Linux interfaces — GRE tunnels, VLANs, VXLANs, bonded links, or plain ethernet — can
benefit from per-interface traffic visibility. The nginx instances that serve the traffic already observe everything an operator wants to
see — they are the authoritative source for request rate, response code mix, bytes moved, and latency distributions. A small in-process
module emits those numbers on an HTTP endpoint, and consumers scrape the data filtered by source tag.
One motivating use case is [`vpp-maglev`](https://git.ipng.ch/ipng/vpp-maglev), where each load-balancer instance terminates a GRE
tunnel on the nginx host. The module attributes traffic per tunnel, letting the frontend show per-backend counters that VPP's fast path
cannot provide. But the module is not coupled to that use case — it works with any interface type and any consumer.
## Goals and Non-Goals
### Product Goals
1. **Per-VIP, per-device traffic visibility.** For each VIP, the module records request count, status-code distribution, bytes in and
out, and request-duration histograms, split by which interface delivered the traffic.
2. **Negligible hot-path cost.** At steady state, a request traversing an nginx worker with the module loaded pays at most a handful of
non-atomic integer increments and a histogram bucket update. No locks, no allocations, no system calls.
3. **Two readers, one endpoint.** A single HTTP location serves both Prometheus text and JSON, so a site running Prometheus and a site
using a custom consumer can both consume the module without extra configuration.
4. **Packaging as a dynamic module.** The module builds with nginx's `--with-compat` ABI and ships as a Debian package that loads into
stock upstream nginx without recompiling nginx itself.
5. **Composable with normal nginx use.** A host running the module with device-bound listeners **and** serving unrelated direct web
traffic on the same ports MUST remain a correct nginx deployment. The module MUST NOT change the semantics of any existing directive;
it only adds new parameters and directives that are no-ops when unused.
6. **Graceful reload.** An `nginx -s reload` MUST NOT reset counters, lose history, or drop in-flight connections from the module's point
of view.
### Non-Goals
- The module is **not** a generic nginx metrics exporter. It does not aim to replace `nginx-module-vts`, `ngx_http_stub_status`, or
`nginx-lua-prometheus`. Its metric set is deliberately narrow: per-VIP, per-device counters, histograms, and rate gauges.
- The module does **not** terminate TLS, rewrite headers, or alter the request in any way. It is observation-only.
- The module does **not** talk to any external daemon. It does not initiate gRPC or read any external configuration. The attribution tag
it emits is a string supplied by the operator in the `listen` directive; nothing more.
- The module does **not** provide per-client-IP, per-path, or per-User-Agent counters. Those dimensions explode cardinality and belong in
access logs and existing log-analysis tools.
- The module does **not** provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on
restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus.
- The module does **not** own the interfaces or the VIP addresses. Interface creation, VIP binding, and nginx master privileges
are the operator's responsibility.
## Requirements
Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that later sections can cite it.
### Functional Requirements
**FR-1 Attribution**
- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=<ifname>`, which records a mapping
between the named interface and the listen's source tag. Attribution at request time MUST read the ingress ifindex from the
kernel's `IP_PKTINFO` / `IPV6_PKTINFO` cmsg (enabled on every HTTP listening socket) and match against that mapping. The module
MUST NOT apply `SO_BINDTODEVICE` to the listening socket: the option pins both ingress and egress, which breaks DSR / maglev
deployments where the SYN arrives via a GRE tunnel but the SYN-ACK must leave via the default route. A listen directive without
`device=` MUST create a plain listening socket as stock nginx does.
- **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `ipng_source_tag=<tag>`, which attaches a short string tag to
the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener.
- **FR-1.3** A listening socket with neither `device=` nor `ipng_source_tag=` MUST be tagged with the configured default source string (see
`ipng_stats_default_source`, FR-5.3). The default default is the literal string `direct`.
- **FR-1.4** A listening socket with `device=X` but no `ipng_source_tag=` MUST be tagged with the interface name `X`.
- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist. Since no `SO_BINDTODEVICE`
is applied, the kernel delivers all matching SYNs to a single wildcard listening socket and the module distinguishes them by
reading `ifindex` from the per-connection cmsg — so "multiple device-tagged listens at the same port" at the config level
collapses to one kernel socket at runtime without any userspace contortions.
- **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=<ifname>` MUST attribute
every connection whose ingress interface is `<ifname>` — regardless of which local address the client addressed — to that
listen's source tag. Traffic on other interfaces MUST fall back to the configured default source (see FR-1.3).
**FR-2 Counters**
- **FR-2.1** The module MUST maintain, for every observed `(source, vip, status_class)` tuple, the following counters: total requests,
total bytes received (sum of request bytes including request line, headers, and body), total bytes sent (sum of response bytes
including status line, headers, and body), and sum of request durations in milliseconds (exported as `nginx_ipng_latency_total`).
The module MUST additionally maintain, per `(source, vip)` pair (no `code` label), fixed-bucket histograms of request duration in
milliseconds and of request/response sizes in bytes.
- **FR-2.2** When an upstream is used to serve the request, the module MUST additionally maintain a fixed-bucket histogram of upstream
response time in milliseconds, keyed by the same `(source, vip)` pair.
- **FR-2.3** The duration histogram bucket boundaries MUST be fixed at module initialization and MUST be the same for every `(source,
vip)` key. The default boundaries are `{1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000}` milliseconds plus an implicit
`+Inf` bucket. Operators MAY override the boundaries via the `ipng_stats_buckets` directive at the `http` level. The byte-size
histograms (request and response bodies) use independent bounds defaulting to `{100, 1000, 10000, 100000, 1000000, 10000000}` bytes;
`ipng_stats_byte_buckets` overrides them.
- **FR-2.4** The module MUST additionally maintain, per `(source, vip)` pair, exponentially-weighted moving averages for instantaneous
request rate with decay windows of 1 second, 10 seconds, and 60 seconds. EWMAs are updated from the periodic flush tick (see FR-4.2),
not from the request path.
- **FR-2.5** The `vip` dimension of every counter MUST be the connection's `$server_addr` in its canonical textual form (dotted-quad for
IPv4, RFC 5952 lowercase-compressed form for IPv6). IPv6 zone identifiers (scope-ids), if any, MUST be stripped during canonicalization;
link-local VIPs (which are not expected in practice) are attributed under their scope-less textual form. Port is not part of the key;
a VIP that listens on both 80 and 443 MUST be aggregated.
- **FR-2.6** The `status_code` dimension MUST be bucketed into a single class label: `1xx`, `2xx`, `3xx`, `4xx`, `5xx`, or `unknown` for
codes outside `[100, 599]`. This bounds per-`(source, vip)` cardinality to six lanes regardless of how many distinct three-digit
codes are observed. Operators who need a full per-code breakdown SHOULD enable `ipng_stats_logtail` (FR-8) and derive the per-code
view from the access-log stream off the hot path; the stats zone intentionally trades that resolution for a much smaller scrape
response.
**FR-3 Scrape endpoint**
- **FR-3.1** The module MUST provide a new nginx handler directive, `ipng_stats;`, that, when placed in a `location` block, causes that
location to serve the module's counters and MUST NOT be combinable with other content handlers in the same location.
- **FR-3.2** The `ipng_stats` handler MUST support content negotiation via the `Accept` request header:
- `Accept: application/json` → JSON output.
- `Accept: text/plain` (or anything else, including absent) → Prometheus text exposition format.
- **FR-3.3** The handler MUST support a `source_tag=<tag>` query parameter that filters the output to only counters whose source dimension
equals the supplied tag. The comparison is exact-match and case-sensitive.
- **FR-3.4** The handler MUST support a `vip=<address>` query parameter that filters the output to only counters whose VIP dimension
equals the supplied address. The comparison uses the canonicalized form of FR-2.5.
- **FR-3.5** Both filters MAY be supplied together; their effect is the intersection.
- **FR-3.6** The JSON schema MUST be documented in `docs/scrape-api.md` and MUST version via a top-level `schema` field so that breaking
changes can be made additively without bricking existing consumers.
- **FR-3.7** The Prometheus text output MUST use stable metric names prefixed with `nginx_ipng_` and MUST label every series with
`source_tag` and `vip`. Counter metrics (`nginx_ipng_requests_total`, `nginx_ipng_bytes_{in,out}_total`, `nginx_ipng_latency_total`)
additionally carry a `code` label with a class value (`1xx`..`5xx`/`unknown`). Histogram series (duration, upstream response,
request/response byte size) MUST NOT carry a `code` label — they aggregate across all classes for a given `(source, vip)` pair.
**FR-4 Hot path and flush**
- **FR-4.1** Per-request counter updates MUST occur in the nginx log phase and MUST be localized to the current worker's private counter
table. The module MUST NOT take any locks on the request path and MUST NOT issue any atomic operation on the request path.
- **FR-4.2** Each worker MUST run a periodic timer, default one second, that flushes the worker's private counter deltas into the
shared-memory zone using atomic adds. The flush interval is configurable via the `ipng_stats_flush_interval` directive.
- **FR-4.3** The scrape handler MUST read only from the shared-memory zone. Workers MUST NOT read from each other's private tables.
- **FR-4.4** Histogram updates MUST be branch-light: the module MUST precompute a small lookup that maps elapsed milliseconds to a bucket
index using binary search over the fixed boundary array, and MUST NOT scan the array linearly.
**FR-5 Directives**
- **FR-5.1** `ipng_stats_zone name:size` at the `http` level declares the shared-memory zone the module uses. `name` is the zone name (no
default); `size` is a size with suffix (`k`, `m`). The directive is mandatory if the module is loaded.
- **FR-5.2** `ipng_stats_flush_interval <duration>` at the `http` level sets the worker flush cadence. Default `1s`. Minimum `100ms`.
- **FR-5.3** `ipng_stats_default_source <tag>` at the `http` level sets the tag applied to listening sockets that have neither `device=`
nor `ipng_source_tag=`. Default `direct`.
- **FR-5.4** `ipng_stats_buckets <ms ms ms ...>` at the `http` level overrides the default histogram bucket boundaries. Values MUST be
strictly increasing positive integers.
- **FR-5.5** `ipng_stats on|off` at the `http`, `server`, or `location` level opts a context into or out of counting. Default `on` at the
`http` level when the module is loaded. A location serving the `ipng_stats` handler MUST NOT have itself counted (the module
automatically sets `off` for the scrape location).
**FR-6 Variables**
- **FR-6.1** The module MUST register an nginx variable `$ipng_source_tag` that resolves to the source tag of the listening socket that
accepted the current connection. For device-bound listeners this is the `ipng_source_tag=` value (or the `device=` name if
`ipng_source_tag=` was not set); for wildcard fallback listeners this is the value of `ipng_stats_default_source`. The variable is
usable in `log_format`, `map`, `add_header`, `if`, and any other nginx context that accepts variables.
- **FR-6.2** `$ipng_source_tag` MUST be available unconditionally when the module is loaded, even if `ipng_stats_zone` is not
configured. It does not depend on the counter subsystem; it only depends on the listen-parameter parsing. Operators who need the VIP
address should use nginx's built-in `$server_addr` variable.
**FR-7 Packaging**
- **FR-7.1** The module MUST build as a dynamic module using nginx's `--with-compat --add-dynamic-module=...` flow, against the nginx-dev
headers of the target Debian release, so that the resulting `.so` loads into stock upstream nginx on that release without rebuilding
nginx itself.
- **FR-7.2** The module MUST ship as a Debian package named `libnginx-mod-http-ipng-stats`, following the `libnginx-mod-http-*` naming
convention used by existing third-party nginx modules packaged for Debian.
- **FR-7.3** The package MUST install:
- `/usr/lib/nginx/modules/ngx_http_ipng_stats_module.so`
- `/etc/nginx/modules-available/50-mod-http-ipng-stats.conf` containing the `load_module` directive.
- A symlink `/etc/nginx/modules-enabled/50-mod-http-ipng-stats.conf → ../modules-available/50-mod-http-ipng-stats.conf` created in the
package's postinst.
- **FR-7.4** The package postinst MUST run `nginx -t` after installing the module. If the test fails, postinst MUST remove the
`modules-enabled` symlink and report a non-fatal warning so that a broken upgrade does not leave the operator's nginx unable to start.
**FR-8 Logtail**
- **FR-8.1** The module MUST support an `ipng_stats_logtail <format_name> udp://host:port [buffer=<size>] [flush=<duration>] [if=$var]`
directive at the `http` level that registers a global log-phase writer which fires for every request (unless suppressed by `if=`),
regardless of which `server` or `location` block handled it. One directive at the `http` level is sufficient to cover all vhosts —
operators MUST NOT be required to repeat an `access_log` directive in every `server` block to achieve a single global access log.
- **FR-8.2** The `<format_name>` argument MUST be the name of an existing nginx `log_format` defined in the same `http` block before
this directive. The module MUST look up the compiled log format from nginx's log module at configuration time and use it to render each
log line at request time. The module MUST NOT define its own format language; all `$variable` expansion is handled by nginx's standard
log-format machinery, including `$ipng_source_tag` and `$server_addr`.
- **FR-8.3** Each worker MUST buffer log lines in a per-worker memory buffer before transmitting them as UDP datagrams. The buffer size
is controlled by the optional `buffer=<size>` parameter (default `64k`, minimum `1k`). The buffer MUST be flushed when it is full,
when the optional `flush=<duration>` timer fires (default `1s`, minimum `100ms`), or when the worker exits. This ensures that a
graceful `nginx -s reload` or a clean worker shutdown transmits all buffered log entries.
- **FR-8.4** The destination argument of `ipng_stats_logtail` MUST be a `udp://host:port` URI, where `host` is a literal IPv4 address
(no hostnames, no IPv6 in v0.1). Each buffer flush is transmitted as a single `sendto()` call on a per-worker `SOCK_DGRAM` socket
opened at worker init and closed at worker exit. If no receiver is listening on the target address and port, the kernel silently
discards the datagram — no error is returned, no disk I/O occurs, and the worker is never blocked. Lost datagrams when no receiver is
present are intentional; the UDP transport is designed for fire-and-forget analytics pipelines where delivery guarantees are
unnecessary and zero disk I/O is preferred over persistence. File-based access logging is not supported by this directive — operators
should use nginx's built-in `access_log` for that purpose.
- **FR-8.5** The directive MAY include an `if=$variable` parameter. When present, the logtail writer MUST evaluate the named nginx
variable at log phase and MUST suppress the log line if the variable is not found, is empty, or equals the string `"0"`. The
condition MUST be checked before the log format is rendered, so that filtered requests incur no formatting cost. This follows the same
semantics as nginx's built-in `access_log ... if=` and is intended for suppressing high-frequency requests (e.g. health checks) from
the logtail stream. Filtered requests MUST still be counted by the stats module — the `if=` condition affects only logtail output.
### Non-Functional Requirements
**NFR-1 Correctness under concurrency**
- **NFR-1.1** Per-worker counter tables MUST be owned exclusively by their worker and MUST NOT be read or written by any other worker,
any handler, or any timer other than the worker's own flush timer.
- **NFR-1.2** Flushes from workers into the shared zone MUST use relaxed atomic `fetch_add` on 64-bit lanes. The module MUST NOT rely on
`memset`, `memcpy`, or any unaligned access for shared-zone updates.
- **NFR-1.3** A scrape that races with a flush MUST observe a monotonically non-decreasing counter value; temporary readings that see
partial flushes across different keys are acceptable, but a single counter MUST never appear to decrease.
- **NFR-1.4** Histogram bucket counts and sum/count fields MUST be updated in a way that a concurrent scrape never observes
`count < sum-of-buckets`. This is achieved by updating bucket counts before the sum/count and by a scraper that reads sum/count before
bucket counts.
**NFR-2 Hot-path cost**
- **NFR-2.1** The per-request cost of the log-phase handler MUST be bounded by: one listening-socket pointer deref, one VIP pointer deref
(cached on the connection struct), a constant-time status-code index computation, a constant number of integer increments, and a
`O(log B)` histogram binary search where `B` is the number of buckets. No syscalls, no allocations, no locks.
- **NFR-2.2** The per-flush cost per worker MUST be bounded by `O(K)` atomic adds, where `K` is the number of distinct
`(source, vip, class)` keys touched by that worker since the last flush. Keys untouched during an interval MUST NOT be visited.
- **NFR-2.3** The scrape cost MUST be bounded by `O(K_total)` reads from the shared zone plus `O(K_total)` string format operations,
where `K_total` is the number of distinct keys in the zone.
**NFR-3 Memory bounds**
- **NFR-3.1** The shared-memory zone MUST be sized by the operator at module-load time (FR-5.1) and MUST NOT grow beyond that size. When
the zone is full, the module MUST drop new keys, increment a dedicated `nginx_ipng_zone_full_events_total` counter, and log at `warn`
level no more than once per minute per worker.
- **NFR-3.2** The per-worker private counter table MUST be bounded by the same total key count the shared zone admits. A worker MUST NOT
accumulate private state that exceeds the shared-zone capacity.
- **NFR-3.3** Status codes are collapsed to six classes (`1xx`..`5xx`/`unknown`) at counter-update time (FR-2.6), bounding per-`(source,
vip)` counter cardinality at six lanes regardless of how many three-digit codes are observed. Any code outside `[100, 599]` falls
into `code="unknown"`. Per-code resolution is available via `ipng_stats_logtail` (FR-8), which operates off the hot path.
**NFR-4 Reload neutrality**
- **NFR-4.1** `nginx -s reload` spawns a new set of workers while the old workers drain. The shared-memory zone MUST survive this
transition; counters MUST NOT reset on reload.
- **NFR-4.2** New workers MUST attach to the existing shared-memory zone under the same name, reconstruct their private counter tables
lazily from observed traffic, and resume flushing.
- **NFR-4.3** The `source` tag for any given listening socket is recomputed at reload time from the new configuration. If the operator
renames a tag, new traffic MUST use the new tag.
- **NFR-4.4** When a `source` tag is no longer present in any listening socket after a configuration reload, its counters MUST be
evicted from the shared-memory zone on the first flush tick following the reload. The module MUST NOT retain historical counters under
defunct tags indefinitely. Rename is expected to be rare and evicting the old entries immediately is acceptable.
**NFR-5 Packaging robustness**
- **NFR-5.1** The module MUST compile cleanly against the nginx-dev headers of the currently supported Debian stable and testing
releases. CI MUST build one `.deb` per supported release and MUST fail if any target breaks.
- **NFR-5.2** The module MUST NOT depend on any shared library beyond `libc` and nginx's own runtime. No `libnetfilter_*`, no `libcurl`,
no `libjson*`.
- **NFR-5.3** A version mismatch between the `.so` and the installed nginx binary MUST be detected by nginx at load time (this is the
purpose of `--with-compat`). The package postinst MUST NOT attempt to work around a mismatch; it reports the failure and leaves the
operator to upgrade the nginx package.
**NFR-6 Security**
- **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. `IP_PKTINFO` /
`IPV6_RECVPKTINFO` are enabled on the listening sockets (ordinary, unprivileged `setsockopt` calls) and the log handler's
`getsockopt(IP_PKTOPTIONS)` on the accepted fd is also unprivileged.
- **NFR-6.2** The scrape endpoint MUST be accessible only via an `allow`/`deny` ACL placed in the operator's nginx config. The module
MUST NOT ship its own authentication or authorization; it is a plain nginx handler and inherits the enclosing `location` block's access
controls.
- **NFR-6.3** The module MUST NOT log client IPs, request paths, `User-Agent`, or any other per-request personally-identifying field. It
logs only aggregate counters and its own operational events.
**NFR-7 Documentation**
- **NFR-7.1** The repository MUST ship a `docs/user-guide.md` that walks an operator through installing the Debian package, loading the
module, configuring a minimal end-to-end deployment (GRE tunnels, VIPs, `listen` lines, scrape endpoint), verifying that counters are
flowing, and integrating the scrape endpoint with Prometheus and other consumers. The user guide is the
document an operator reads once to get from a freshly-installed package to a working, observable deployment.
- **NFR-7.2** The repository MUST ship a `docs/config-guide.md` that enumerates every directive and `listen` parameter introduced by the
module, together with the nginx configuration contexts (`http`, `server`, `location`, or `listen`) in which each is legal, the allowed
values, the default, and a one-sentence summary of behavior. The config guide is the document an operator greps when they need to know
where a given knob is allowed to appear.
## Architecture Overview
### Process Model
The project ships one dynamic nginx module:
- **`ngx_http_ipng_stats_module.so`** — the dynamic module, loaded by nginx's master at startup via `load_module`. It runs entirely inside
the nginx process model: code executes in nginx workers during the request lifecycle and during per-worker timers. No separate process
is launched.
There is no daemon, no socket the module listens on, no control plane. Everything the module does is done inline with nginx.
### Data Flow
Requests enter nginx through one of two listener classes:
1. **Device-bound listeners** (`listen ... device=X ipng_source_tag=Y`) accept only connections whose ingress interface is `X`. Each is tagged
with a source string `Y`.
2. **Wildcard fallback listeners** (`listen 80;`, `listen [::]:80;`) accept everything that didn't match a more specific listener. They
are tagged with the configured default source (FR-1.3).
During request processing nginx behaves exactly as it would without the module: no handler runs early, no header is rewritten. At log
phase, the module's log-phase handler runs two independent responsibilities:
1. **Counter update** — increments the worker-local counter table keyed by `(source, vip, status_code)`.
2. **Logtail write** — if `ipng_stats_logtail` is configured (FR-8), renders the named `log_format` for this request and appends the
resulting line to the per-worker write buffer. The buffer is flushed as a UDP datagram on a timer, when full, or on worker exit
(FR-8.3, FR-8.4). This runs for every request regardless of `server` or `location` context.
A per-worker timer, firing at the configured flush interval (FR-5.2), walks the dirty keys in the worker-local table and applies their
deltas to the shared-memory zone via atomic adds. The same timer triggers a logtail buffer flush if the flush duration has elapsed (FR-8.3).
The scrape handler, when invoked at `GET /.well-known/ipng/statsz` (or whatever location the operator chose), reads the shared-memory zone directly
and formats the output per the requested content type.
Scrape consumers fetch the endpoint at their configured cadence, optionally filtering via `?source_tag=<tag>` so that each consumer only
sees the traffic it delivered.
Aside from the logtail output (FR-8) — which sends UDP datagrams to a configured receiver — no component in this
project writes to anything outside nginx's own memory. The module does not emit log lines on the request path for the counter subsystem
and does not speak to any upstream.
## Components
### The nginx module
`ngx_http_ipng_stats_module` is the entire technical surface of this project. It is a single C module conforming to nginx's
dynamic-module ABI.
#### Responsibilities
- Parse new `listen` parameters `device=` and `ipng_source_tag=` and maintain a per-`(device, family)` → source-tag mapping that
also caches the resolved ifindex (FR-1.1, FR-1.2).
- Enable `IP_PKTINFO` / `IPV6_RECVPKTINFO` on every HTTP listening socket at `init_module` time so each accepted connection
carries an ingress-ifindex cmsg (FR-1.1, NFR-6.1).
- Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_class)` (FR-2.1, NFR-1.1).
- Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2).
- Update EWMAs at flush time (FR-2.4).
- Serve the scrape endpoint with content negotiation and optional filters (FR-3).
- Honor `ipng_stats on|off` at any config level (FR-5.5).
#### Attribution Model
The module's single novel idea is that per-device attribution rides on the kernel's `IP_PKTINFO` cmsg rather than on any userspace
inspection of packets. Each traffic source that should be tracked separately terminates on a dedicated interface on the nginx host;
the operator writes one `listen ... device=<ifname> ipng_source_tag=<tag>` line per `(family, interface)` pair. The module enables
`IP_PKTINFO` / `IPV6_RECVPKTINFO` on the listening socket so the kernel stashes an ingress-ifindex cmsg on every accepted
connection; at request-log time the module reads that cmsg via `getsockopt(IP_PKTOPTIONS)` on the accepted fd and looks the ifindex
up in its per-(device, family) table. Listening sockets are left as plain wildcards — no `SO_BINDTODEVICE` — which means
outgoing packets follow the normal routing table and the plugin composes cleanly with maglev / DSR setups, where the SYN arrives
via a GRE tunnel but the SYN-ACK must leave via the default route.
The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback
and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to.
Because the `device=` binding uses a wildcard address, the module does not need to know the set of VIPs served through each interface.
Adding a VIP (binding an address to `lo` and writing a new `server_name` block) does not require touching the `listen` lines. Adding a
new attributed interface does. This is the correct split: VIPs are vhost-level concerns and change often; interfaces are
infrastructure-level concerns and change rarely.
The design assumes interfaces used as `device=` sources carry **only** traffic from the expected source. Any other traffic arriving on
such an interface is silently misattributed to that interface's source tag. This is a deployment invariant, not a defect.
#### Counter Data Model
Counters are stored as a flat hash table in a shared-memory zone. The key is the tuple `(source_id, vip_id, status_class)` where
`source_id` and `vip_id` are small integers assigned at first observation and `status_class` is one of six values (`0=unknown`,
`1..5` for `1xx`..`5xx`). The value is a fixed-size record containing:
- `requests` (u64)
- `bytes_in` (u64)
- `bytes_out` (u64)
- `duration_sum_ms` (u64) — exported as `nginx_ipng_latency_total` (per class)
- `upstream_sum_ms` (u64)
- `duration_hist` — `B+1` u64 lanes (one per bucket plus the `+Inf` bucket)
- `upstream_hist` — same shape, only updated when an upstream served the request
- `bytes_in_hist`, `bytes_out_hist` — `Bb+1` u64 lanes over the byte-size bucket bounds
Histogram lanes are kept per `(source, vip, class)` in storage, then summed across classes at scrape time to produce one
`_bucket`/`_sum`/`_count` series per `(source, vip)` — the Prometheus exposition never carries a `code` label on histogram series
(FR-3.7).
A parallel table keyed by `(source_id, vip_id)` — one row per VIP — holds the EWMAs for instantaneous rate. EWMAs are floats but updated
only from the flush tick, so there is no float contention on the request path.
The module also keeps a small string interning table for source and VIP strings, keyed by the integer IDs above, so that the scrape
endpoint can recover the original strings without re-parsing configuration.
String interning is capacity-bounded: the zone is sized by the operator, and once capacity is exhausted new keys are dropped with a
counter bump and an infrequent log line (NFR-3.1). In practice, the number of distinct VIPs on a single nginx host is small (tens, maybe
low hundreds), and the number of distinct source tags is the number of attributed interfaces (single digits). Because status codes are
collapsed to six classes (FR-2.6), the `status_class` dimension contributes at most 6× the `(source, vip)` count — a ~10× reduction
from the per-three-digit-code model considered and discarded.
#### Hot Path
The log-phase handler is deliberately short. Pseudocode:
```c
static ngx_int_t
ipng_stats_log_handler(ngx_http_request_t *r)
{
ipng_listen_ctx_t *lctx;
ipng_counter_t *counter;
ngx_msec_int_t elapsed_ms;
ngx_uint_t class_idx;
if (!ipng_stats_enabled(r)) {
return NGX_OK;
}
lctx = ngx_http_ipng_stats_listen_ctx(r->connection->listening);
/* lctx contains source_id and the cached VIP id,
or resolves VIP lazily on first seen address */
class_idx = ipng_status_to_class(r->headers_out.status); /* 0..5 */
counter = ipng_worker_slot(lctx, r->connection->local_sockaddr, class_idx);
counter->requests++;
counter->bytes_in += r->request_length;
counter->bytes_out += r->connection->sent;
elapsed_ms = (ngx_msec_int_t)(ngx_current_msec - r->start_msec);
ipng_hist_add(&counter->duration_hist, elapsed_ms);
counter->duration_sum_ms += elapsed_ms;
if (r->upstream_states && r->upstream_states->nelts > 0) {
ngx_msec_int_t up_ms = ipng_upstream_total_ms(r);
ipng_hist_add(&counter->upstream_hist, up_ms);
counter->upstream_sum_ms += up_ms;
}
return NGX_OK;
}
```
Nothing here touches shared memory. `ipng_worker_slot` resolves a private table slot using a small per-worker hash keyed by
`(source_id, vip_id, class_idx)`. VIP lookup is cached on the connection so that keep-alive requests reuse the resolved ID.
#### Flush Timer
At the interval configured by `ipng_stats_flush_interval` (default 1s), the worker:
1. Iterates its dirty-slot list (slots touched since the previous flush).
2. For each dirty slot, computes the delta versus the last flushed snapshot stored in the same slot.
3. Applies the delta to the shared-zone slot using 64-bit relaxed `fetch_add` on each counter lane.
4. Updates EWMAs from the delta.
5. Clears the dirty list (not the slot itself; slot state is preserved so the next flush can compute deltas again).
The worker never walks the entire table — only dirty slots — so idle VIPs cost nothing.
#### Scrape Handler
The `ipng_stats` handler is a leaf content handler. It:
1. Parses `?source_tag=` and `?vip=` into exact-match filters.
2. Parses `Accept:` to pick output format.
3. Walks the shared-memory zone under a shared lock (readers hold the read side of a rwlock; flushes and interners hold the write side
briefly).
4. Emits each matching key in the chosen format directly into an nginx chain buffer.
Output buffering and sending are standard nginx content handler code. The handler does not allocate during the walk; it uses a
fixed-size buffer per chain link and requests new links only when full.
#### Presents and Consumes
**Presents.**
- **One nginx content handler**, `ipng_stats`, usable in any `location` block. Serves Prometheus text and JSON, filtered by optional
query parameters.
- **Two new `listen` parameters**, `device=` and `ipng_source_tag=`, usable anywhere a `listen` directive is used.
- **Six new `http`-level directives**: `ipng_stats_zone`, `ipng_stats_flush_interval`, `ipng_stats_default_source`,
`ipng_stats_buckets`, `ipng_stats_byte_buckets`, `ipng_stats` (on/off).
- **A Prometheus metric family** prefixed `nginx_ipng_*`, labelled `source_tag`, `vip`, and (for counter metrics) a `code` class label
(`1xx`..`5xx`/`unknown`).
**Consumes.**
- **An nginx shared-memory zone** declared by `ipng_stats_zone`. The zone is allocated from nginx's own shared-memory pool.
- **Linux `IP_PKTINFO` / `IPV6_RECVPKTINFO` socket options** on the listening sockets, and the per-connection
`getsockopt(IP_PKTOPTIONS)` cmsg readback on the accepted fd.
- **The nginx log phase and connection structures** — standard module embedding, no private kernel calls.
### The Debian package
`libnginx-mod-http-ipng-stats` is the packaging wrapper. There is no ambition to build RPMs, Alpine packages, or a Homebrew formula;
Debian is the target and upstream nginx on Debian is the platform.
#### Responsibilities
- Build the module against the target release's nginx-dev headers with `--with-compat` (NFR-5.1, NFR-5.3).
- Install the compiled `.so` into `/usr/lib/nginx/modules` (FR-7.3).
- Drop a `load_module` stanza into `/etc/nginx/modules-available/` and enable it by default via a symlink in `modules-enabled/`
(FR-7.3).
- Sanity-check the resulting config with `nginx -t` in the postinst and back out cleanly if it fails (FR-7.4).
#### Build
The build is a plain `debian/rules` invocation that:
1. Fetches the nginx source for the installed `nginx-dev` version.
2. Runs `./configure --with-compat --add-dynamic-module=...` pointed at the module tree.
3. Builds only the module (`make modules`).
4. Installs the resulting `.so` into the package tree.
No nginx binary is produced, shipped, or touched. The package is strictly additive.
#### Presents and Consumes
**Presents.**
- **One Debian package** per supported release.
- **One dynamic module** loadable into stock upstream nginx.
**Consumes.**
- **The target release's `nginx-dev` package** at build time.
- **The running `nginx` package** at install time, for `nginx -t` validation.
## Operational Concerns
### Deployment Topology
A typical deployment on a single nginx host looks like:
- One interface per traffic source that should be separately attributed (e.g. GRE tunnels, VLANs), set up by the operator's networking
layer (systemd-networkd, Netplan, or a hand-rolled interface config). Interface names follow a consistent pattern, typically
`gre-<tag>` — e.g. `gre-mg1`, `gre-mg2`.
- VIPs bound to a local dummy or loopback interface so the kernel accepts packets destined for them.
- A hand-maintained `listen` include file with one device-bound listen per `(family, interface)` pair, reused across vhosts.
- Fallback `listen 80;` and `listen [::]:80;` in whichever server blocks serve direct web traffic.
- A single scrape location, e.g. `location = /.well-known/ipng/statsz`, served from a locked-down server block that only allows scrape consumers and
the local Prometheus scraper.
### Configuration
A minimal working configuration is about fifteen lines:
```nginx
load_module modules/ngx_http_ipng_stats_module.so;
http {
ipng_stats_zone ipng:4m;
server {
listen 80;
listen [::]:80;
include /etc/nginx/ipng-stats/listens.conf;
server_name _;
# ... normal vhost content
}
server {
listen 127.0.0.1:9113;
location = /.well-known/ipng/statsz {
ipng_stats;
allow 127.0.0.1;
allow 2001:db8::/48; # scrape consumers
deny all;
}
}
}
```
`listens.conf` is two lines per attributed interface (two address families each) and stable across vhost changes.
### Nginx Reload Semantics
`nginx -s reload` forks fresh workers, has old workers finish in-flight requests, and then shuts the old workers down. The plugin's
shared-memory zone is declared by name, which survives the reload; new workers attach to the same zone and continue accumulating
counters against the same keys. Counters MUST NOT reset on reload (NFR-4.1).
Source tags are recomputed from the new configuration on reload (NFR-4.3). Renaming a tag in configuration means new traffic appears
under the new name; the old name lingers in the zone until either operator restart or an LRU eviction policy ages it out (this is one
of the open questions below).
### Observability of the Plugin Itself
The plugin emits a handful of meta-metrics on the same scrape endpoint:
- `nginx_ipng_zone_bytes_used` / `nginx_ipng_zone_bytes_total` — zone high-water and capacity.
- `nginx_ipng_zone_full_events_total` — number of key insertions that were dropped because the zone was full.
- `nginx_ipng_flushes_total` — number of per-worker flush ticks that have run.
- `nginx_ipng_flush_duration_seconds` — histogram of flush durations.
- `nginx_ipng_scrape_duration_seconds` — histogram of scrape handler durations.
These make it possible to alert on "the module is running hot" and "the zone is full" without having to run a second scraper against
some other endpoint.
### Failure Modes
- **Shared zone full.** New keys are dropped, a counter is incremented, a rate-limited warning is logged, and the operator is expected
to resize the zone. Existing keys continue updating normally (NFR-3.1).
- **Worker crash.** The crashed worker's private counter deltas since its last flush are lost. The shared zone is unaffected. Since the
default flush interval is one second, the worst-case data loss is one second of that worker's traffic. This is acceptable for an
observability plane.
- **nginx master crash / package upgrade.** The shared zone is torn down with the old master. When the new master starts, the zone is
recreated empty. Counters start from zero. Consumers that need history SHOULD read from Prometheus, which retains history across
restarts.
- **Device disappears.** If an operator removes an interface without removing its `listen` line, nginx's bind will fail on the next
reload and the reload will error cleanly. The module does not hide this; a failing `nginx -t` is the right answer.
- **Traffic on a wildcard listener that should have been device-bound.** The traffic is counted under `direct` (or the configured
default). This is detectable: if the operator expects zero traffic under `direct` and the dashboard shows non-zero, an interface is
probably missing from the `listen` include.
- **Slow scrape on a large zone.** Scrape cost is linear in the number of keys (NFR-2.3). On a host with a very large VIP count, the
operator SHOULD increase the flush interval, lower the scrape frequency, or both. The module does not cap scrape runtime.
- **Scrape consumer is down.** The module is unaffected; its counters continue to increment and the Prometheus scrape continues to work.
When the consumer comes back, it resumes fetching. No state is lost.
### Security
- **Capabilities.** The module needs no capabilities beyond what nginx already has. `IP_PKTINFO` / `IPV6_RECVPKTINFO` on the
listening socket and `IP_PKTOPTIONS` read-back on the accepted fd are ordinary unprivileged socket options (NFR-6.1).
- **Scrape access control.** The operator MUST place the scrape `location` behind an `allow`/`deny` ACL. The module does not ship auth;
this is deliberate, and documented (NFR-6.2).
- **No PII.** The module records only aggregate per-VIP counters. Client IP, request path, headers, and cookies are not touched.
Access-log-style observation belongs in nginx's own access log (NFR-6.3).
- **Zone sizing as a soft DoS mitigation.** Because new keys are dropped when the zone is full rather than allocating unbounded memory,
a stream of bogus traffic cannot cause the module to exhaust nginx's memory. The tradeoff is that a real new VIP added after zone
exhaustion won't be tracked until the operator resizes — explicit and visible in the meta-metrics.
## Alternatives Considered
- **OpenResty + `lua-nginx-module` + `nginx-lua-prometheus`.** Rejected. Adds a large runtime dependency just for a narrow feature. The
deployment target is stock upstream nginx on Debian, and shipping an entirely different nginx build would defeat half the point of
packaging.
- **Access log tailing sidecar.** Rejected. Decoupled but introduces a second deploy unit, a log-rotation race, and a synchronization
gap between access log truncation and counter accuracy. Also loses live EWMAs.
- **`nginx-module-vts`.** Considered. VTS is a perfectly good general-purpose metric module, but it has no concept of "which ingress
interface did this request come in on", which is the entire innovation here. Adapting VTS to attribute by ingress interface would be a
bigger diff than writing a purpose-built module.
- **Attribution via CONNMARK on a single shared GRE tunnel.** Rejected after investigation. Netfilter loses the outer GRE source during
decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked, `SO_MARK` on an
accepted socket does not reflect incoming packet or conntrack mark without a per-packet `libnetfilter_conntrack` lookup, which is too
heavy for a log-phase handler.
- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than the `IP_PKTINFO` approach: it still
requires per-source tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `IP_PKTINFO` solves
the same problem with plain socket options nginx already knows about.
- **Attribution via `SO_BINDTODEVICE` on per-device listening sockets.** Rejected after a v0.4.0 release tested it in production:
`SO_BINDTODEVICE` also pins *egress* to the bound interface, including the SYN-ACK that the kernel sends from the listening
socket before `accept()` returns — which there is no hook to override in userspace. On DSR / maglev setups (SYN via a GRE
tunnel, SYN-ACK expected out the default route) the return packet goes back up the GRE tunnel and gets dropped by the uplink,
so no connection ever establishes. `IP_PKTINFO` gives us the same ingress identification at zero egress cost.
- **Attribution via eBPF `SO_REUSEPORT` programs.** Rejected as dramatic overkill for a problem the kernel already solves for free via
socket-lookup specificity.
- **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1 ipng_source_tag=mg1;`. The wildcard form works
because nginx routes by `server_name` post-accept, so the `listen` only needs to express `(port, device)` and does not need the VIP
address. This makes the generated include file size independent of the VIP count.
- **Pushing counters to an external daemon over gRPC.** Rejected. It complicates restart neutrality and adds a gRPC client dependency to
a C module. Pull-based scrape is simpler: consumers fetch when they want, and the module has no outbound connections.
- **Shipping separate JSON and Prometheus handlers.** Rejected. Content negotiation on one handler is simpler to configure and serves
both audiences from one ACL.
## Decisions Deferred Post-v0.1
- **Histogram bucket overrides per `source` or per `vip`.** v0.1 keeps FR-2.3's module-level set. If a single nginx instance ever serves
both latency-sensitive (API) and bulk (download) traffic on the same host such that one bucket set is too compromised, making buckets
per-`source` or per-`vip` is possible but multiplies memory and complicates Prometheus output.
- **TLS handshake metrics.** The module reports `request_duration` from the start of the HTTP request, not from TCP accept. For
TLS-terminating frontends a handshake-time fraction is invisible. Adding a `tls_handshake_duration` histogram is deferred until
operators ask for it.
- **Consumer fetch cadence.** Whichever cadence a consumer adopts for traffic counters — a one-second refresh, a longer Prometheus
scrape interval, or an SSE bridge layered on top — the plugin supports it. The choice is on the consumer side.