Add ngx_http_ipng_stats_module: per-VIP, per-device traffic counters

Full implementation of the nginx dynamic module with:
- SO_BINDTODEVICE-based per-interface traffic attribution
- Per-worker lock-free counters flushed to shared memory
- Prometheus text and JSON scrape endpoint at configurable location
- UDP-only global logtail (ipng_stats_logtail) for fire-and-forget
  access log streaming
- $ipng_source_tag nginx variable for use in log_format/map
- Histogram buckets, EWMA rate gauges, zone meta-metrics
- Debian packaging (libnginx-mod-http-ipng-stats)
- Robot Framework end-to-end tests via containerlab
- SPDX Apache-2.0 headers on all source files
This commit is contained in:
2026-04-16 17:36:42 +02:00
parent c05bcf6aa6
commit 5a7e2f77f1
25 changed files with 4016 additions and 102 deletions

View File

@@ -1,4 +1,5 @@
# nginx-vpp-maglev-plugin Design Document
<!-- SPDX-License-Identifier: Apache-2.0 -->
# nginx-ipng-stats-plugin Design Document
## Metadata
@@ -7,7 +8,7 @@
| **Status** | Draft — describes intended behavior for `v0.1.0` |
| **Author** | Pim van Pelt `<pim@ipng.ch>` |
| **Last updated** | 2026-04-16 |
| **Audience** | Operators and contributors building the nginx-side observability half of `vpp-maglev` |
| **Audience** | Operators and contributors deploying per-device, per-VIP traffic observability on nginx |
The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and **MAY** are used as described in
[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are reserved in this document for requirements that are intended to be
@@ -16,60 +17,52 @@ lowercase — "can", "will", "does" — and should not be read as normative.
## Summary
`nginx-vpp-maglev-plugin` is a dynamic nginx module and its surrounding Debian packaging. Loaded into stock upstream nginx, the module records
per-VIP traffic counters — requests, status codes, bytes, latency — and attributes them to the specific `vpp-maglev` instance whose GRE
tunnel delivered each connection. A small HTTP scrape endpoint exposes the counters as both Prometheus text and JSON so that
`maglevd-frontend`, Prometheus, and ad-hoc `curl` sessions can all read the same data. The module is the nginx-side answer to the open
question in [`vpp-maglev/docs/design.md`](../../vpp-maglev/docs/design.md) about per-backend traffic counters: VPP's `lb` plugin bypasses
the FIB and cannot produce them, so the backends report what they see.
`nginx-ipng-stats-plugin` is a dynamic nginx module and its surrounding Debian packaging. Loaded into stock upstream nginx, the module
records per-VIP traffic counters — requests, status codes, bytes, latency — and attributes them to the specific interface on which each
connection arrived. A small HTTP scrape endpoint exposes the counters as both Prometheus text and JSON so that Prometheus, custom
dashboards, and ad-hoc `curl` sessions can all read the same data.
## Background
`vpp-maglev` programs VPP's `lb` plugin so that traffic hashed to a VIP lands on a pool of healthy Application Servers (ASes). For the
deployment this module targets, every AS is an nginx instance receiving GRE-encapsulated traffic from one or more `maglevd` daemons,
decapsulating it, and terminating or proxying HTTP and HTTPS as it would for any other inbound client.
Any deployment where traffic arrives on distinct Linux interfaces — GRE tunnels, VLANs, VXLANs, bonded links, or plain ethernet — can
benefit from per-interface traffic visibility. The nginx instances that serve the traffic already observe everything an operator wants to
see — they are the authoritative source for request rate, response code mix, bytes moved, and latency distributions. A small in-process
module emits those numbers on an HTTP endpoint, and consumers scrape the data filtered by source tag.
The design document for `vpp-maglev` identifies **per-AS traffic counters** as an explicit open question: VPP's `lb` fast path bypasses
the FIB, so VPP exposes per-VIP counters in the stats segment but not per-backend ones. An operator looking at the `maglevd-frontend`
status page for a frontend with four backends can see the frontend's aggregate packet rate but not which backend is carrying how much of
it, which errors are concentrated on which backend, or whether one backend's p95 latency is drifting.
This project closes that gap from the opposite end. The nginx instances that serve the traffic already observe everything an operator
wants to see — they are the authoritative source for request rate, response code mix, bytes moved, and latency distributions. A small
in-process module emits those numbers on an HTTP endpoint, and `maglevd-frontend` fans out to the backends of each frontend and aggregates
the result into the existing status page.
One motivating use case is [`vpp-maglev`](https://git.ipng.ch/ipng/vpp-maglev), where each load-balancer instance terminates a GRE
tunnel on the nginx host. The module attributes traffic per tunnel, letting the frontend show per-backend counters that VPP's fast path
cannot provide. But the module is not coupled to that use case — it works with any interface type and any consumer.
## Goals and Non-Goals
### Product Goals
1. **Per-VIP, per-maglev traffic visibility.** For each VIP, the module records request count, status-code distribution, bytes in and out,
and request-duration histograms, split by which `maglevd` instance delivered the traffic.
1. **Per-VIP, per-device traffic visibility.** For each VIP, the module records request count, status-code distribution, bytes in and
out, and request-duration histograms, split by which interface delivered the traffic.
2. **Negligible hot-path cost.** At steady state, a request traversing an nginx worker with the module loaded pays at most a handful of
non-atomic integer increments and a histogram bucket update. No locks, no allocations, no system calls.
3. **Two readers, one endpoint.** A single HTTP location serves both Prometheus text and JSON, so a site running Prometheus and a site
using only the `maglevd-frontend` UI can both consume the module without extra configuration.
using a custom consumer can both consume the module without extra configuration.
4. **Packaging as a dynamic module.** The module builds with nginx's `--with-compat` ABI and ships as a Debian package that loads into
stock upstream nginx without recompiling nginx itself.
5. **Composable with normal nginx use.** A host running the module as a maglev backend **and** serving unrelated direct web traffic on the
same ports MUST remain a correct nginx deployment. The module MUST NOT change the semantics of any existing directive; it only adds new
parameters and directives that are no-ops when unused.
5. **Composable with normal nginx use.** A host running the module with device-bound listeners **and** serving unrelated direct web
traffic on the same ports MUST remain a correct nginx deployment. The module MUST NOT change the semantics of any existing directive;
it only adds new parameters and directives that are no-ops when unused.
6. **Graceful reload.** An `nginx -s reload` MUST NOT reset counters, lose history, or drop in-flight connections from the module's point
of view.
### Non-Goals
- The module is **not** a generic nginx metrics exporter. It does not aim to replace `nginx-module-vts`, `ngx_http_stub_status`, or
`nginx-lua-prometheus`. Its metric set is deliberately narrow and shaped by the `maglevd-frontend` status page.
`nginx-lua-prometheus`. Its metric set is deliberately narrow: per-VIP, per-device counters, histograms, and rate gauges.
- The module does **not** terminate TLS, rewrite headers, or alter the request in any way. It is observation-only.
- The module does **not** talk to `maglevd` directly. It does not initiate gRPC, it does not read maglev configuration, and it does not
know which maglev instance owns which VIP. The attribution tag it emits is a string supplied by the operator in the `listen` directive;
nothing more.
- The module does **not** talk to any external daemon. It does not initiate gRPC or read any external configuration. The attribution tag
it emits is a string supplied by the operator in the `listen` directive; nothing more.
- The module does **not** provide per-client-IP, per-path, or per-User-Agent counters. Those dimensions explode cardinality and belong in
access logs and existing log-analysis tools.
- The module does **not** provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on
restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus.
- The module does **not** own the GRE tunnels, the VIP addresses, or the `SO_BINDTODEVICE` privilege. Tunnel creation, VIP binding, and
- The module does **not** own the interfaces, the VIP addresses, or the `SO_BINDTODEVICE` privilege. Interface creation, VIP binding, and
nginx master privileges are the operator's responsibility.
## Requirements
@@ -83,11 +76,11 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=<ifname>`, which causes the resulting
listening socket to be created with `SO_BINDTODEVICE` set to the named interface. A listen directive without `device=` MUST create a
plain listening socket as stock nginx does.
- **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `source=<tag>`, which attaches a short string tag to
- **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `ipng_source_tag=<tag>`, which attaches a short string tag to
the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener.
- **FR-1.3** A listening socket with neither `device=` nor `source=` MUST be tagged with the configured default source string (see
- **FR-1.3** A listening socket with neither `device=` nor `ipng_source_tag=` MUST be tagged with the configured default source string (see
`ipng_stats_default_source`, FR-5.3). The default default is the literal string `direct`.
- **FR-1.4** A listening socket with `device=X` but no `source=` MUST be tagged with the interface name `X`.
- **FR-1.4** A listening socket with `device=X` but no `ipng_source_tag=` MUST be tagged with the interface name `X`.
- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist, and the kernel's TCP socket lookup
rules MUST be relied on to dispatch each SYN to the most specific match. The module MUST NOT attempt to duplicate this logic in
userspace.
@@ -122,14 +115,14 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
- **FR-3.2** The `ipng_stats` handler MUST support content negotiation via the `Accept` request header:
- `Accept: application/json` → JSON output.
- `Accept: text/plain` (or anything else, including absent) → Prometheus text exposition format.
- **FR-3.3** The handler MUST support a `source=<tag>` query parameter that filters the output to only counters whose source dimension
- **FR-3.3** The handler MUST support a `source_tag=<tag>` query parameter that filters the output to only counters whose source dimension
equals the supplied tag. The comparison is exact-match and case-sensitive.
- **FR-3.4** The handler MUST support a `vip=<address>` query parameter that filters the output to only counters whose VIP dimension
equals the supplied address. The comparison uses the canonicalized form of FR-2.5.
- **FR-3.5** Both filters MAY be supplied together; their effect is the intersection.
- **FR-3.6** The JSON schema MUST be documented in `docs/scrape-api.md` and MUST version via a top-level `schema` field so that breaking
changes can be made additively without bricking existing consumers.
- **FR-3.7** The Prometheus text output MUST use stable metric names prefixed with `nginx_ipng_` and MUST label every series with `source`
- **FR-3.7** The Prometheus text output MUST use stable metric names prefixed with `nginx_ipng_` and MUST label every series with `source_tag`
and `vip`. Counter metrics additionally carry a `code` label.
**FR-4 Hot path and flush**
@@ -148,28 +141,60 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
default); `size` is a size with suffix (`k`, `m`). The directive is mandatory if the module is loaded.
- **FR-5.2** `ipng_stats_flush_interval <duration>` at the `http` level sets the worker flush cadence. Default `1s`. Minimum `100ms`.
- **FR-5.3** `ipng_stats_default_source <tag>` at the `http` level sets the tag applied to listening sockets that have neither `device=`
nor `source=`. Default `direct`.
nor `ipng_source_tag=`. Default `direct`.
- **FR-5.4** `ipng_stats_buckets <ms ms ms ...>` at the `http` level overrides the default histogram bucket boundaries. Values MUST be
strictly increasing positive integers.
- **FR-5.5** `ipng_stats on|off` at the `http`, `server`, or `location` level opts a context into or out of counting. Default `on` at the
`http` level when the module is loaded. A location serving the `ipng_stats` handler MUST NOT have itself counted (the module
automatically sets `off` for the scrape location).
**FR-6 Packaging**
**FR-6 Variables**
- **FR-6.1** The module MUST build as a dynamic module using nginx's `--with-compat --add-dynamic-module=...` flow, against the nginx-dev
- **FR-6.1** The module MUST register an nginx variable `$ipng_source_tag` that resolves to the source tag of the listening socket that
accepted the current connection. For device-bound listeners this is the `ipng_source_tag=` value (or the `device=` name if
`ipng_source_tag=` was not set); for wildcard fallback listeners this is the value of `ipng_stats_default_source`. The variable is
usable in `log_format`, `map`, `add_header`, `if`, and any other nginx context that accepts variables.
- **FR-6.2** `$ipng_source_tag` MUST be available unconditionally when the module is loaded, even if `ipng_stats_zone` is not
configured. It does not depend on the counter subsystem; it only depends on the listen-parameter parsing. Operators who need the VIP
address should use nginx's built-in `$server_addr` variable.
**FR-7 Packaging**
- **FR-7.1** The module MUST build as a dynamic module using nginx's `--with-compat --add-dynamic-module=...` flow, against the nginx-dev
headers of the target Debian release, so that the resulting `.so` loads into stock upstream nginx on that release without rebuilding
nginx itself.
- **FR-6.2** The module MUST ship as a Debian package named `libnginx-mod-http-ipng-stats`, following the `libnginx-mod-http-*` naming
- **FR-7.2** The module MUST ship as a Debian package named `libnginx-mod-http-ipng-stats`, following the `libnginx-mod-http-*` naming
convention used by existing third-party nginx modules packaged for Debian.
- **FR-6.3** The package MUST install:
- **FR-7.3** The package MUST install:
- `/usr/lib/nginx/modules/ngx_http_ipng_stats_module.so`
- `/etc/nginx/modules-available/50-mod-http-ipng-stats.conf` containing the `load_module` directive.
- A symlink `/etc/nginx/modules-enabled/50-mod-http-ipng-stats.conf → ../modules-available/50-mod-http-ipng-stats.conf` created in the
package's postinst.
- **FR-6.4** The package postinst MUST run `nginx -t` after installing the module. If the test fails, postinst MUST remove the
- **FR-7.4** The package postinst MUST run `nginx -t` after installing the module. If the test fails, postinst MUST remove the
`modules-enabled` symlink and report a non-fatal warning so that a broken upgrade does not leave the operator's nginx unable to start.
**FR-8 Logtail**
- **FR-8.1** The module MUST support an `ipng_stats_logtail <format_name> udp://host:port [buffer=<size>] [flush=<duration>]` directive
at the `http` level that registers a global log-phase writer which fires unconditionally for every request, regardless of which
`server` or `location` block handled it. One directive at the `http` level is sufficient to cover all vhosts — operators MUST NOT be
required to repeat an `access_log` directive in every `server` block to achieve a single global access log.
- **FR-8.2** The `<format_name>` argument MUST be the name of an existing nginx `log_format` defined in the same `http` block before
this directive. The module MUST look up the compiled log format from nginx's log module at configuration time and use it to render each
log line at request time. The module MUST NOT define its own format language; all `$variable` expansion is handled by nginx's standard
log-format machinery, including `$ipng_source_tag` and `$server_addr`.
- **FR-8.3** Each worker MUST buffer log lines in a per-worker memory buffer before transmitting them as UDP datagrams. The buffer size
is controlled by the optional `buffer=<size>` parameter (default `64k`, minimum `1k`). The buffer MUST be flushed when it is full,
when the optional `flush=<duration>` timer fires (default `1s`, minimum `100ms`), or when the worker exits. This ensures that a
graceful `nginx -s reload` or a clean worker shutdown transmits all buffered log entries.
- **FR-8.4** The destination argument of `ipng_stats_logtail` MUST be a `udp://host:port` URI, where `host` is a literal IPv4 address
(no hostnames, no IPv6 in v0.1). Each buffer flush is transmitted as a single `sendto()` call on a per-worker `SOCK_DGRAM` socket
opened at worker init and closed at worker exit. If no receiver is listening on the target address and port, the kernel silently
discards the datagram — no error is returned, no disk I/O occurs, and the worker is never blocked. Lost datagrams when no receiver is
present are intentional; the UDP transport is designed for fire-and-forget analytics pipelines where delivery guarantees are
unnecessary and zero disk I/O is preferred over persistence. File-based access logging is not supported by this directive — operators
should use nginx's built-in `access_log` for that purpose.
### Non-Functional Requirements
**NFR-1 Correctness under concurrency**
@@ -242,7 +267,7 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
- **NFR-7.1** The repository MUST ship a `docs/user-guide.md` that walks an operator through installing the Debian package, loading the
module, configuring a minimal end-to-end deployment (GRE tunnels, VIPs, `listen` lines, scrape endpoint), verifying that counters are
flowing, and integrating the scrape endpoint with both `maglevd-frontend` and a standalone Prometheus scraper. The user guide is the
flowing, and integrating the scrape endpoint with Prometheus and other consumers. The user guide is the
document an operator reads once to get from a freshly-installed package to a working, observable deployment.
- **NFR-7.2** The repository MUST ship a `docs/config-guide.md` that enumerates every directive and `listen` parameter introduced by the
module, together with the nginx configuration contexts (`http`, `server`, `location`, or `listen`) in which each is legal, the allowed
@@ -265,26 +290,31 @@ There is no daemon, no socket the module listens on, no control plane. Everythin
Requests enter nginx through one of two listener classes:
1. **Device-bound listeners** (`listen ... device=X source=Y`) accept only connections whose ingress interface is `X`. Each is tagged
1. **Device-bound listeners** (`listen ... device=X ipng_source_tag=Y`) accept only connections whose ingress interface is `X`. Each is tagged
with a source string `Y`.
2. **Wildcard fallback listeners** (`listen 80;`, `listen [::]:80;`) accept everything that didn't match a more specific listener. They
are tagged with the configured default source (FR-1.3).
During request processing nginx behaves exactly as it would without the module: no handler runs early, no header is rewritten. At log
phase, the module's log-phase handler increments the worker-local counter table keyed by `(source, vip, status_code)`.
phase, the module's log-phase handler runs two independent responsibilities:
1. **Counter update** — increments the worker-local counter table keyed by `(source, vip, status_code)`.
2. **Logtail write** — if `ipng_stats_logtail` is configured (FR-8), renders the named `log_format` for this request and appends the
resulting line to the per-worker write buffer. The buffer is flushed as a UDP datagram on a timer, when full, or on worker exit
(FR-8.3, FR-8.4). This runs for every request regardless of `server` or `location` context.
A per-worker timer, firing at the configured flush interval (FR-5.2), walks the dirty keys in the worker-local table and applies their
deltas to the shared-memory zone via atomic adds.
deltas to the shared-memory zone via atomic adds. The same timer triggers a logtail buffer flush if the flush duration has elapsed (FR-8.3).
The scrape handler, when invoked at `GET /ipng-stats` (or whatever location the operator chose), reads the shared-memory zone directly
The scrape handler, when invoked at `GET /.well-known/ipng/statsz` (or whatever location the operator chose), reads the shared-memory zone directly
and formats the output per the requested content type.
`maglevd-frontend` fetches the scrape endpoint of each backend in its configured fleet at roughly the same cadence it already uses for
maglevd state. It filters server-side via `?source=<its own tag>` so that it only sees the traffic it delivered. The aggregated view is
rendered alongside the existing maglev status page.
Scrape consumers fetch the endpoint at their configured cadence, optionally filtering via `?source_tag=<tag>` so that each consumer only
sees the traffic it delivered.
No component in this project writes to anything outside nginx's own memory. In particular, the module does not touch the file system,
does not emit log lines on the request path, and does not speak to any upstream.
Aside from the logtail output (FR-8) — which sends UDP datagrams to a configured receiver — no component in this
project writes to anything outside nginx's own memory. The module does not emit log lines on the request path for the counter subsystem
and does not speak to any upstream.
## Components
@@ -295,7 +325,7 @@ dynamic-module ABI.
#### Responsibilities
- Parse new `listen` parameters `device=` and `source=` and attach their values to each listening socket's config (FR-1.1, FR-1.2).
- Parse new `listen` parameters `device=` and `ipng_source_tag=` and attach their values to each listening socket's config (FR-1.1, FR-1.2).
- Call `setsockopt(SO_BINDTODEVICE)` in the master process at bind time for listeners that set `device=` (FR-1.1, NFR-6.1).
- Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_code)` (FR-2.1, NFR-1.1).
- Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2).
@@ -305,22 +335,22 @@ dynamic-module ABI.
#### Attribution Model
The module's single novel idea is that per-maglev attribution is done by the Linux kernel's TCP socket lookup, not by any userspace
inspection. Each `maglevd` instance terminates its GRE tunnel on a dedicated interface on the nginx host; the operator writes one
`listen ... device=<ifname> source=<tag>` line per `(family, tunnel)` pair. The kernel binds that listening socket with `SO_BINDTODEVICE`,
which causes it to match only connections whose ingress interface is that tunnel. A wildcard `listen 80;` and `listen [::]:80;` pair
provides the fallback for traffic arriving on any other interface — typically normal web traffic, not from maglev.
The module's single novel idea is that per-device attribution is done by the Linux kernel's TCP socket lookup, not by any userspace
inspection. Each traffic source that should be tracked separately terminates on a dedicated interface on the nginx host; the operator
writes one `listen ... device=<ifname> ipng_source_tag=<tag>` line per `(family, interface)` pair. The kernel binds that listening socket
with `SO_BINDTODEVICE`, which causes it to match only connections whose ingress interface is that device. A wildcard `listen 80;` and
`listen [::]:80;` pair provides the fallback for traffic arriving on any other interface — typically normal web traffic.
The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback
and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to.
Because the `device=` binding uses a wildcard address, the module does not need to know the set of VIPs served through each tunnel.
Because the `device=` binding uses a wildcard address, the module does not need to know the set of VIPs served through each interface.
Adding a VIP (binding an address to `lo` and writing a new `server_name` block) does not require touching the `listen` lines. Adding a
new maglev instance (a new GRE tunnel) does. This is the correct split: VIPs are vhost-level concerns and change often; maglev instances
are fleet-level concerns and change rarely.
new attributed interface does. This is the correct split: VIPs are vhost-level concerns and change often; interfaces are
infrastructure-level concerns and change rarely.
The design assumes GRE tunnels used as `device=` sources carry **only** maglev-originated traffic. Any other traffic arriving on such an
interface is silently misattributed to that maglev's source tag. This is a deployment invariant, not a defect.
The design assumes interfaces used as `device=` sources carry **only** traffic from the expected source. Any other traffic arriving on
such an interface is silently misattributed to that interface's source tag. This is a deployment invariant, not a defect.
#### Counter Data Model
@@ -344,7 +374,7 @@ endpoint can recover the original strings without re-parsing configuration.
String interning is capacity-bounded: the zone is sized by the operator, and once capacity is exhausted new keys are dropped with a
counter bump and an infrequent log line (NFR-3.1). In practice, the number of distinct VIPs on a single nginx host is small (tens, maybe
low hundreds), and the number of distinct source tags is the number of maglev instances (single digits). The dominant factor is
low hundreds), and the number of distinct source tags is the number of attributed interfaces (single digits). The dominant factor is
`status_code`; ~60 keys per VIP is a typical steady state.
#### Hot Path
@@ -408,7 +438,7 @@ The worker never walks the entire table — only dirty slots — so idle VIPs co
The `ipng_stats` handler is a leaf content handler. It:
1. Parses `?source=` and `?vip=` into exact-match filters.
1. Parses `?source_tag=` and `?vip=` into exact-match filters.
2. Parses `Accept:` to pick output format.
3. Walks the shared-memory zone under a shared lock (readers hold the read side of a rwlock; flushes and interners hold the write side
briefly).
@@ -423,10 +453,10 @@ fixed-size buffer per chain link and requests new links only when full.
- **One nginx content handler**, `ipng_stats`, usable in any `location` block. Serves Prometheus text and JSON, filtered by optional
query parameters.
- **Two new `listen` parameters**, `device=` and `source=`, usable anywhere a `listen` directive is used.
- **Two new `listen` parameters**, `device=` and `ipng_source_tag=`, usable anywhere a `listen` directive is used.
- **Five new `http`-level directives**: `ipng_stats_zone`, `ipng_stats_flush_interval`, `ipng_stats_default_source`,
`ipng_stats_buckets`, `ipng_stats` (on/off).
- **A Prometheus metric family** prefixed `nginx_ipng_*`, labelled `source`, `vip`, and (for request counters) `code`.
- **A Prometheus metric family** prefixed `nginx_ipng_*`, labelled `source_tag`, `vip`, and (for request counters) `code`.
**Consumes.**
@@ -442,10 +472,10 @@ Debian is the target and upstream nginx on Debian is the platform.
#### Responsibilities
- Build the module against the target release's nginx-dev headers with `--with-compat` (NFR-5.1, NFR-5.3).
- Install the compiled `.so` into `/usr/lib/nginx/modules` (FR-6.3).
- Install the compiled `.so` into `/usr/lib/nginx/modules` (FR-7.3).
- Drop a `load_module` stanza into `/etc/nginx/modules-available/` and enable it by default via a symlink in `modules-enabled/`
(FR-6.3).
- Sanity-check the resulting config with `nginx -t` in the postinst and back out cleanly if it fails (FR-6.4).
(FR-7.3).
- Sanity-check the resulting config with `nginx -t` in the postinst and back out cleanly if it fails (FR-7.4).
#### Build
@@ -476,12 +506,13 @@ No nginx binary is produced, shipped, or touched. The package is strictly additi
A typical deployment on a single nginx host looks like:
- One GRE tunnel per maglev instance, terminated on the nginx host by the operator's networking layer (systemd-networkd, Netplan, or a
hand-rolled interface config). Interface names follow a consistent pattern, typically `gre-<tag>` — e.g. `gre-mg1`, `gre-mg2`.
- VIPs bound to a local dummy or loopback interface so the kernel accepts inner packets destined for them.
- A hand-maintained `listen` include file with one device-bound listen per `(family, tunnel)` pair, reused across vhosts.
- One interface per traffic source that should be separately attributed (e.g. GRE tunnels, VLANs), set up by the operator's networking
layer (systemd-networkd, Netplan, or a hand-rolled interface config). Interface names follow a consistent pattern, typically
`gre-<tag>` — e.g. `gre-mg1`, `gre-mg2`.
- VIPs bound to a local dummy or loopback interface so the kernel accepts packets destined for them.
- A hand-maintained `listen` include file with one device-bound listen per `(family, interface)` pair, reused across vhosts.
- Fallback `listen 80;` and `listen [::]:80;` in whichever server blocks serve direct web traffic.
- A single scrape location, e.g. `location = /ipng-stats`, served from a locked-down server block that only allows the maglev fleet and
- A single scrape location, e.g. `location = /.well-known/ipng/statsz`, served from a locked-down server block that only allows scrape consumers and
the local Prometheus scraper.
### Configuration
@@ -497,7 +528,7 @@ http {
server {
listen 80;
listen [::]:80;
include /etc/nginx/ipng-maglev/listens.conf;
include /etc/nginx/ipng-stats/listens.conf;
server_name _;
# ... normal vhost content
@@ -505,17 +536,17 @@ http {
server {
listen 127.0.0.1:9113;
location = /ipng-stats {
location = /.well-known/ipng/statsz {
ipng_stats;
allow 127.0.0.1;
allow 2001:db8::/48; # maglev fleet
allow 2001:db8::/48; # scrape consumers
deny all;
}
}
}
```
`listens.conf` is eight lines (two families × four maglevs) and stable across vhost changes.
`listens.conf` is two lines per attributed interface (two address families each) and stable across vhost changes.
### Nginx Reload Semantics
@@ -550,15 +581,15 @@ some other endpoint.
- **nginx master crash / package upgrade.** The shared zone is torn down with the old master. When the new master starts, the zone is
recreated empty. Counters start from zero. Consumers that need history SHOULD read from Prometheus, which retains history across
restarts.
- **Device disappears.** If an operator removes a GRE tunnel without removing its `listen` line, nginx's bind will fail on the next
- **Device disappears.** If an operator removes an interface without removing its `listen` line, nginx's bind will fail on the next
reload and the reload will error cleanly. The module does not hide this; a failing `nginx -t` is the right answer.
- **Traffic on a wildcard listener that should have been device-bound.** The traffic is counted under `direct` (or the configured
default). This is detectable: if the operator expects zero traffic under `direct` and the dashboard shows non-zero, a maglev instance
is probably missing from the `listen` include.
default). This is detectable: if the operator expects zero traffic under `direct` and the dashboard shows non-zero, an interface is
probably missing from the `listen` include.
- **Slow scrape on a large zone.** Scrape cost is linear in the number of keys (NFR-2.3). On a host with a very large VIP count, the
operator SHOULD increase the flush interval, lower the scrape frequency, or both. The module does not cap scrape runtime.
- **Maglev frontend is down.** The module is unaffected; its counters continue to increment and the Prometheus scrape continues to work.
When the frontend comes back, it resumes fetching. No state is lost.
- **Scrape consumer is down.** The module is unaffected; its counters continue to increment and the Prometheus scrape continues to work.
When the consumer comes back, it resumes fetching. No state is lost.
### Security
@@ -586,18 +617,16 @@ some other endpoint.
decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked, `SO_MARK` on an
accepted socket does not reflect incoming packet or conntrack mark without a per-packet `libnetfilter_conntrack` lookup, which is too
heavy for a log-phase handler.
- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than `SO_BINDTODEVICE`: it still requires per-maglev
- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than `SO_BINDTODEVICE`: it still requires per-source
tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `SO_BINDTODEVICE` solves the same problem with
kernel primitives nginx already knows about.
- **Attribution via eBPF `SO_REUSEPORT` programs.** Rejected as dramatic overkill for a problem the kernel already solves for free via
socket-lookup specificity.
- **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1;`. The wildcard form works
- **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1 ipng_source_tag=mg1;`. The wildcard form works
because nginx routes by `server_name` post-accept, so the `listen` only needs to express `(port, device)` and does not need the VIP
address. This makes the generated include file size independent of the VIP count.
- **Pushing counters from the module into `maglevd` over gRPC.** Rejected. It inverts the wait-for graph (maglevd's design doc is
careful to keep the daemon free of callbacks from the backends), it complicates restart neutrality, and it adds a gRPC client to a C
module. Pull-based scrape keeps maglevd out of the traffic-metrics business, matches the doc's philosophy, and lets the frontend use
its existing per-server goroutine model.
- **Pushing counters to an external daemon over gRPC.** Rejected. It complicates restart neutrality and adds a gRPC client dependency to
a C module. Pull-based scrape is simpler: consumers fetch when they want, and the module has no outbound connections.
- **Shipping separate JSON and Prometheus handlers.** Rejected. Content negotiation on one handler is simpler to configure and serves
both audiences from one ACL.
@@ -609,5 +638,5 @@ some other endpoint.
- **TLS handshake metrics.** The module reports `request_duration` from the start of the HTTP request, not from TCP accept. For
TLS-terminating frontends a handshake-time fraction is invisible. Adding a `tls_handshake_duration` histogram is deferred until
operators ask for it.
- **`maglevd-frontend` fetch cadence.** Whichever cadence the frontend adopts for traffic counters — the existing ~one-second refresh,
or an SSE bridge layered on top — the plugin supports it. The choice is on the frontend side.
- **Consumer fetch cadence.** Whichever cadence a consumer adopts for traffic counters — a one-second refresh, a longer Prometheus
scrape interval, or an SSE bridge layered on top — the plugin supports it. The choice is on the consumer side.