40 KiB
nginx-vpp-maglev-plugin Design Document
Metadata
| Status | Draft — describes intended behavior for v0.1.0 |
| Author | Pim van Pelt <pim@ipng.ch> |
| Last updated | 2026-04-16 |
| Audience | Operators and contributors building the nginx-side observability half of vpp-maglev |
The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY are used as described in RFC 2119, and are reserved in this document for requirements that are intended to be enforced in code or by an external dependency. Plain-language descriptions of what the system or an operator can do are written in lowercase — "can", "will", "does" — and should not be read as normative.
Summary
nginx-vpp-maglev-plugin is a dynamic nginx module and its surrounding Debian packaging. Loaded into stock upstream nginx, the module records
per-VIP traffic counters — requests, status codes, bytes, latency — and attributes them to the specific vpp-maglev instance whose GRE
tunnel delivered each connection. A small HTTP scrape endpoint exposes the counters as both Prometheus text and JSON so that
maglevd-frontend, Prometheus, and ad-hoc curl sessions can all read the same data. The module is the nginx-side answer to the open
question in vpp-maglev/docs/design.md about per-backend traffic counters: VPP's lb plugin bypasses
the FIB and cannot produce them, so the backends report what they see.
Background
vpp-maglev programs VPP's lb plugin so that traffic hashed to a VIP lands on a pool of healthy Application Servers (ASes). For the
deployment this module targets, every AS is an nginx instance receiving GRE-encapsulated traffic from one or more maglevd daemons,
decapsulating it, and terminating or proxying HTTP and HTTPS as it would for any other inbound client.
The design document for vpp-maglev identifies per-AS traffic counters as an explicit open question: VPP's lb fast path bypasses
the FIB, so VPP exposes per-VIP counters in the stats segment but not per-backend ones. An operator looking at the maglevd-frontend
status page for a frontend with four backends can see the frontend's aggregate packet rate but not which backend is carrying how much of
it, which errors are concentrated on which backend, or whether one backend's p95 latency is drifting.
This project closes that gap from the opposite end. The nginx instances that serve the traffic already observe everything an operator
wants to see — they are the authoritative source for request rate, response code mix, bytes moved, and latency distributions. A small
in-process module emits those numbers on an HTTP endpoint, and maglevd-frontend fans out to the backends of each frontend and aggregates
the result into the existing status page.
Goals and Non-Goals
Product Goals
- Per-VIP, per-maglev traffic visibility. For each VIP, the module records request count, status-code distribution, bytes in and out,
and request-duration histograms, split by which
maglevdinstance delivered the traffic. - Negligible hot-path cost. At steady state, a request traversing an nginx worker with the module loaded pays at most a handful of non-atomic integer increments and a histogram bucket update. No locks, no allocations, no system calls.
- Two readers, one endpoint. A single HTTP location serves both Prometheus text and JSON, so a site running Prometheus and a site
using only the
maglevd-frontendUI can both consume the module without extra configuration. - Packaging as a dynamic module. The module builds with nginx's
--with-compatABI and ships as a Debian package that loads into stock upstream nginx without recompiling nginx itself. - Composable with normal nginx use. A host running the module as a maglev backend and serving unrelated direct web traffic on the same ports MUST remain a correct nginx deployment. The module MUST NOT change the semantics of any existing directive; it only adds new parameters and directives that are no-ops when unused.
- Graceful reload. An
nginx -s reloadMUST NOT reset counters, lose history, or drop in-flight connections from the module's point of view.
Non-Goals
- The module is not a generic nginx metrics exporter. It does not aim to replace
nginx-module-vts,ngx_http_stub_status, ornginx-lua-prometheus. Its metric set is deliberately narrow and shaped by themaglevd-frontendstatus page. - The module does not terminate TLS, rewrite headers, or alter the request in any way. It is observation-only.
- The module does not talk to
maglevddirectly. It does not initiate gRPC, it does not read maglev configuration, and it does not know which maglev instance owns which VIP. The attribution tag it emits is a string supplied by the operator in thelistendirective; nothing more. - The module does not provide per-client-IP, per-path, or per-User-Agent counters. Those dimensions explode cardinality and belong in access logs and existing log-analysis tools.
- The module does not provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus.
- The module does not own the GRE tunnels, the VIP addresses, or the
SO_BINDTODEVICEprivilege. Tunnel creation, VIP binding, and nginx master privileges are the operator's responsibility.
Requirements
Each requirement carries a unique identifier (FR-X.Y or NFR-X.Y) so that later sections can cite it.
Functional Requirements
FR-1 Attribution
- FR-1.1 The module MUST support a new parameter on the nginx
listendirective,device=<ifname>, which causes the resulting listening socket to be created withSO_BINDTODEVICEset to the named interface. A listen directive withoutdevice=MUST create a plain listening socket as stock nginx does. - FR-1.2 The module MUST support a new parameter on the nginx
listendirective,source=<tag>, which attaches a short string tag to the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener. - FR-1.3 A listening socket with neither
device=norsource=MUST be tagged with the configured default source string (seeipng_stats_default_source, FR-5.3). The default default is the literal stringdirect. - FR-1.4 A listening socket with
device=Xbut nosource=MUST be tagged with the interface nameX. - FR-1.5 Two
listendirectives that shareaddress:portbut differ indevice=MUST coexist, and the kernel's TCP socket lookup rules MUST be relied on to dispatch each SYN to the most specific match. The module MUST NOT attempt to duplicate this logic in userspace. - FR-1.6 A
listendirective that uses a wildcard address (80,[::]:80) together withdevice=<ifname>MUST accept only connections whose ingress interface is<ifname>, for any local address served through that interface. This is the intended deployment shape: wildcard fallback plus per-tunnel device-bound listeners.
FR-2 Counters
- FR-2.1 The module MUST maintain, for every observed
(source, vip, status_code)tuple, the following counters: total requests, total bytes received (sum of request bytes including request line, headers, and body), total bytes sent (sum of response bytes including status line, headers, and body), and a fixed-bucket histogram of request duration in milliseconds. - FR-2.2 When an upstream is used to serve the request, the module MUST additionally maintain a fixed-bucket histogram of upstream
response time in milliseconds, keyed by the same
(source, vip)pair. - FR-2.3 The histogram bucket boundaries MUST be fixed at module initialization and MUST be the same for every
(source, vip)key. The default boundaries are{1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000}milliseconds plus an implicit+Infbucket. Operators MAY override the boundaries via theipng_stats_bucketsdirective at thehttplevel. - FR-2.4 The module MUST additionally maintain, per
(source, vip)pair, exponentially-weighted moving averages for instantaneous request rate with decay windows of 1 second, 10 seconds, and 60 seconds. EWMAs are updated from the periodic flush tick (see FR-4.2), not from the request path. - FR-2.5 The
vipdimension of every counter MUST be the connection's$server_addrin its canonical textual form (dotted-quad for IPv4, RFC 5952 lowercase-compressed form for IPv6). IPv6 zone identifiers (scope-ids), if any, MUST be stripped during canonicalization; link-local VIPs (which are not expected in practice) are attributed under their scope-less textual form. Port is not part of the key; a VIP that listens on both 80 and 443 MUST be aggregated. - FR-2.6 The
status_codedimension MUST be the full three-digit HTTP status code as recorded by nginx at log phase. The module MUST NOT bucket codes into classes (2xx/3xx/4xx/5xx); bucketing is the consumer's job.
FR-3 Scrape endpoint
- FR-3.1 The module MUST provide a new nginx handler directive,
ipng_stats;, that, when placed in alocationblock, causes that location to serve the module's counters and MUST NOT be combinable with other content handlers in the same location. - FR-3.2 The
ipng_statshandler MUST support content negotiation via theAcceptrequest header:Accept: application/json→ JSON output.Accept: text/plain(or anything else, including absent) → Prometheus text exposition format.
- FR-3.3 The handler MUST support a
source=<tag>query parameter that filters the output to only counters whose source dimension equals the supplied tag. The comparison is exact-match and case-sensitive. - FR-3.4 The handler MUST support a
vip=<address>query parameter that filters the output to only counters whose VIP dimension equals the supplied address. The comparison uses the canonicalized form of FR-2.5. - FR-3.5 Both filters MAY be supplied together; their effect is the intersection.
- FR-3.6 The JSON schema MUST be documented in
docs/scrape-api.mdand MUST version via a top-levelschemafield so that breaking changes can be made additively without bricking existing consumers. - FR-3.7 The Prometheus text output MUST use stable metric names prefixed with
nginx_ipng_and MUST label every series withsourceandvip. Counter metrics additionally carry acodelabel.
FR-4 Hot path and flush
- FR-4.1 Per-request counter updates MUST occur in the nginx log phase and MUST be localized to the current worker's private counter table. The module MUST NOT take any locks on the request path and MUST NOT issue any atomic operation on the request path.
- FR-4.2 Each worker MUST run a periodic timer, default one second, that flushes the worker's private counter deltas into the
shared-memory zone using atomic adds. The flush interval is configurable via the
ipng_stats_flush_intervaldirective. - FR-4.3 The scrape handler MUST read only from the shared-memory zone. Workers MUST NOT read from each other's private tables.
- FR-4.4 Histogram updates MUST be branch-light: the module MUST precompute a small lookup that maps elapsed milliseconds to a bucket index using binary search over the fixed boundary array, and MUST NOT scan the array linearly.
FR-5 Directives
- FR-5.1
ipng_stats_zone name:sizeat thehttplevel declares the shared-memory zone the module uses.nameis the zone name (no default);sizeis a size with suffix (k,m). The directive is mandatory if the module is loaded. - FR-5.2
ipng_stats_flush_interval <duration>at thehttplevel sets the worker flush cadence. Default1s. Minimum100ms. - FR-5.3
ipng_stats_default_source <tag>at thehttplevel sets the tag applied to listening sockets that have neitherdevice=norsource=. Defaultdirect. - FR-5.4
ipng_stats_buckets <ms ms ms ...>at thehttplevel overrides the default histogram bucket boundaries. Values MUST be strictly increasing positive integers. - FR-5.5
ipng_stats on|offat thehttp,server, orlocationlevel opts a context into or out of counting. Defaultonat thehttplevel when the module is loaded. A location serving theipng_statshandler MUST NOT have itself counted (the module automatically setsofffor the scrape location).
FR-6 Packaging
- FR-6.1 The module MUST build as a dynamic module using nginx's
--with-compat --add-dynamic-module=...flow, against the nginx-dev headers of the target Debian release, so that the resulting.soloads into stock upstream nginx on that release without rebuilding nginx itself. - FR-6.2 The module MUST ship as a Debian package named
libnginx-mod-http-ipng-stats, following thelibnginx-mod-http-*naming convention used by existing third-party nginx modules packaged for Debian. - FR-6.3 The package MUST install:
/usr/lib/nginx/modules/ngx_http_ipng_stats_module.so/etc/nginx/modules-available/50-mod-http-ipng-stats.confcontaining theload_moduledirective.- A symlink
/etc/nginx/modules-enabled/50-mod-http-ipng-stats.conf → ../modules-available/50-mod-http-ipng-stats.confcreated in the package's postinst.
- FR-6.4 The package postinst MUST run
nginx -tafter installing the module. If the test fails, postinst MUST remove themodules-enabledsymlink and report a non-fatal warning so that a broken upgrade does not leave the operator's nginx unable to start.
Non-Functional Requirements
NFR-1 Correctness under concurrency
- NFR-1.1 Per-worker counter tables MUST be owned exclusively by their worker and MUST NOT be read or written by any other worker, any handler, or any timer other than the worker's own flush timer.
- NFR-1.2 Flushes from workers into the shared zone MUST use relaxed atomic
fetch_addon 64-bit lanes. The module MUST NOT rely onmemset,memcpy, or any unaligned access for shared-zone updates. - NFR-1.3 A scrape that races with a flush MUST observe a monotonically non-decreasing counter value; temporary readings that see partial flushes across different keys are acceptable, but a single counter MUST never appear to decrease.
- NFR-1.4 Histogram bucket counts and sum/count fields MUST be updated in a way that a concurrent scrape never observes
count < sum-of-buckets. This is achieved by updating bucket counts before the sum/count and by a scraper that reads sum/count before bucket counts.
NFR-2 Hot-path cost
- NFR-2.1 The per-request cost of the log-phase handler MUST be bounded by: one listening-socket pointer deref, one VIP pointer deref
(cached on the connection struct), a constant-time status-code index computation, a constant number of integer increments, and a
O(log B)histogram binary search whereBis the number of buckets. No syscalls, no allocations, no locks. - NFR-2.2 The per-flush cost per worker MUST be bounded by
O(K)atomic adds, whereKis the number of distinct(source, vip, code)keys touched by that worker since the last flush. Keys untouched during an interval MUST NOT be visited. - NFR-2.3 The scrape cost MUST be bounded by
O(K_total)reads from the shared zone plusO(K_total)string format operations, whereK_totalis the number of distinct keys in the zone.
NFR-3 Memory bounds
- NFR-3.1 The shared-memory zone MUST be sized by the operator at module-load time (FR-5.1) and MUST NOT grow beyond that size. When
the zone is full, the module MUST drop new keys, increment a dedicated
nginx_ipng_zone_full_events_totalcounter, and log atwarnlevel no more than once per minute per worker. - NFR-3.2 The per-worker private counter table MUST be bounded by the same total key count the shared zone admits. A worker MUST NOT accumulate private state that exceeds the shared-zone capacity.
- NFR-3.3 The set of distinct status codes observed is small (typically ≤ 60) and MUST NOT be allowed to explode due to non-standard
responses; the module MUST clamp any observed code
< 100or>= 600into a single bucket labeledcode="unknown"rather than allocating a new key.
NFR-4 Reload neutrality
- NFR-4.1
nginx -s reloadspawns a new set of workers while the old workers drain. The shared-memory zone MUST survive this transition; counters MUST NOT reset on reload. - NFR-4.2 New workers MUST attach to the existing shared-memory zone under the same name, reconstruct their private counter tables lazily from observed traffic, and resume flushing.
- NFR-4.3 The
sourcetag for any given listening socket is recomputed at reload time from the new configuration. If the operator renames a tag, new traffic MUST use the new tag. - NFR-4.4 When a
sourcetag is no longer present in any listening socket after a configuration reload, its counters MUST be evicted from the shared-memory zone on the first flush tick following the reload. The module MUST NOT retain historical counters under defunct tags indefinitely. Rename is expected to be rare and evicting the old entries immediately is acceptable.
NFR-5 Packaging robustness
- NFR-5.1 The module MUST compile cleanly against the nginx-dev headers of the currently supported Debian stable and testing
releases. CI MUST build one
.debper supported release and MUST fail if any target breaks. - NFR-5.2 The module MUST NOT depend on any shared library beyond
libcand nginx's own runtime. Nolibnetfilter_*, nolibcurl, nolibjson*. - NFR-5.3 A version mismatch between the
.soand the installed nginx binary MUST be detected by nginx at load time (this is the purpose of--with-compat). The package postinst MUST NOT attempt to work around a mismatch; it reports the failure and leaves the operator to upgrade the nginx package.
NFR-6 Security
- NFR-6.1 The module MUST NOT require any Linux capability beyond what stock nginx already needs. The
SO_BINDTODEVICEcall is made in the nginx master process which is already privileged during the bind step; workers never callsetsockopt(SO_BINDTODEVICE)themselves. - NFR-6.2 The scrape endpoint MUST be accessible only via an
allow/denyACL placed in the operator's nginx config. The module MUST NOT ship its own authentication or authorization; it is a plain nginx handler and inherits the enclosinglocationblock's access controls. - NFR-6.3 The module MUST NOT log client IPs, request paths,
User-Agent, or any other per-request personally-identifying field. It logs only aggregate counters and its own operational events.
NFR-7 Documentation
- NFR-7.1 The repository MUST ship a
docs/user-guide.mdthat walks an operator through installing the Debian package, loading the module, configuring a minimal end-to-end deployment (GRE tunnels, VIPs,listenlines, scrape endpoint), verifying that counters are flowing, and integrating the scrape endpoint with bothmaglevd-frontendand a standalone Prometheus scraper. The user guide is the document an operator reads once to get from a freshly-installed package to a working, observable deployment. - NFR-7.2 The repository MUST ship a
docs/config-guide.mdthat enumerates every directive andlistenparameter introduced by the module, together with the nginx configuration contexts (http,server,location, orlisten) in which each is legal, the allowed values, the default, and a one-sentence summary of behavior. The config guide is the document an operator greps when they need to know where a given knob is allowed to appear.
Architecture Overview
Process Model
The project ships one dynamic nginx module:
ngx_http_ipng_stats_module.so— the dynamic module, loaded by nginx's master at startup viaload_module. It runs entirely inside the nginx process model: code executes in nginx workers during the request lifecycle and during per-worker timers. No separate process is launched.
There is no daemon, no socket the module listens on, no control plane. Everything the module does is done inline with nginx.
Data Flow
Requests enter nginx through one of two listener classes:
- Device-bound listeners (
listen ... device=X source=Y) accept only connections whose ingress interface isX. Each is tagged with a source stringY. - Wildcard fallback listeners (
listen 80;,listen [::]:80;) accept everything that didn't match a more specific listener. They are tagged with the configured default source (FR-1.3).
During request processing nginx behaves exactly as it would without the module: no handler runs early, no header is rewritten. At log
phase, the module's log-phase handler increments the worker-local counter table keyed by (source, vip, status_code).
A per-worker timer, firing at the configured flush interval (FR-5.2), walks the dirty keys in the worker-local table and applies their deltas to the shared-memory zone via atomic adds.
The scrape handler, when invoked at GET /ipng-stats (or whatever location the operator chose), reads the shared-memory zone directly
and formats the output per the requested content type.
maglevd-frontend fetches the scrape endpoint of each backend in its configured fleet at roughly the same cadence it already uses for
maglevd state. It filters server-side via ?source=<its own tag> so that it only sees the traffic it delivered. The aggregated view is
rendered alongside the existing maglev status page.
No component in this project writes to anything outside nginx's own memory. In particular, the module does not touch the file system, does not emit log lines on the request path, and does not speak to any upstream.
Components
The nginx module
ngx_http_ipng_stats_module is the entire technical surface of this project. It is a single C module conforming to nginx's
dynamic-module ABI.
Responsibilities
- Parse new
listenparametersdevice=andsource=and attach their values to each listening socket's config (FR-1.1, FR-1.2). - Call
setsockopt(SO_BINDTODEVICE)in the master process at bind time for listeners that setdevice=(FR-1.1, NFR-6.1). - Maintain per-worker private counter tables keyed by
(source_id, vip_id, status_code)(FR-2.1, NFR-1.1). - Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2).
- Update EWMAs at flush time (FR-2.4).
- Serve the scrape endpoint with content negotiation and optional filters (FR-3).
- Honor
ipng_stats on|offat any config level (FR-5.5).
Attribution Model
The module's single novel idea is that per-maglev attribution is done by the Linux kernel's TCP socket lookup, not by any userspace
inspection. Each maglevd instance terminates its GRE tunnel on a dedicated interface on the nginx host; the operator writes one
listen ... device=<ifname> source=<tag> line per (family, tunnel) pair. The kernel binds that listening socket with SO_BINDTODEVICE,
which causes it to match only connections whose ingress interface is that tunnel. A wildcard listen 80; and listen [::]:80; pair
provides the fallback for traffic arriving on any other interface — typically normal web traffic, not from maglev.
The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to.
Because the device= binding uses a wildcard address, the module does not need to know the set of VIPs served through each tunnel.
Adding a VIP (binding an address to lo and writing a new server_name block) does not require touching the listen lines. Adding a
new maglev instance (a new GRE tunnel) does. This is the correct split: VIPs are vhost-level concerns and change often; maglev instances
are fleet-level concerns and change rarely.
The design assumes GRE tunnels used as device= sources carry only maglev-originated traffic. Any other traffic arriving on such an
interface is silently misattributed to that maglev's source tag. This is a deployment invariant, not a defect.
Counter Data Model
Counters are stored as a flat hash table in a shared-memory zone. The key is the tuple (source_id, vip_id, status_code) where
source_id and vip_id are small integers assigned at first observation and reused thereafter. The value is a fixed-size record
containing:
requests(u64)bytes_in(u64)bytes_out(u64)duration_hist—B+1u64 lanes (one per bucket plus the+Infbucket)duration_sum_ms(u64)upstream_hist— same shape, only updated when an upstream served the requestupstream_sum_ms(u64)
A parallel table keyed by (source_id, vip_id) — one row per VIP — holds the EWMAs for instantaneous rate. EWMAs are floats but updated
only from the flush tick, so there is no float contention on the request path.
The module also keeps a small string interning table for source and VIP strings, keyed by the integer IDs above, so that the scrape endpoint can recover the original strings without re-parsing configuration.
String interning is capacity-bounded: the zone is sized by the operator, and once capacity is exhausted new keys are dropped with a
counter bump and an infrequent log line (NFR-3.1). In practice, the number of distinct VIPs on a single nginx host is small (tens, maybe
low hundreds), and the number of distinct source tags is the number of maglev instances (single digits). The dominant factor is
status_code; ~60 keys per VIP is a typical steady state.
Hot Path
The log-phase handler is deliberately short. Pseudocode:
static ngx_int_t
ipng_stats_log_handler(ngx_http_request_t *r)
{
ipng_listen_ctx_t *lctx;
ipng_counter_t *counter;
ngx_msec_int_t elapsed_ms;
ngx_uint_t code_idx;
if (!ipng_stats_enabled(r)) {
return NGX_OK;
}
lctx = ngx_http_ipng_stats_listen_ctx(r->connection->listening);
/* lctx contains source_id and the cached VIP id,
or resolves VIP lazily on first seen address */
code_idx = ipng_status_to_index(r->headers_out.status);
counter = ipng_worker_slot(lctx, r->connection->local_sockaddr, code_idx);
counter->requests++;
counter->bytes_in += r->request_length;
counter->bytes_out += r->connection->sent;
elapsed_ms = (ngx_msec_int_t)(ngx_current_msec - r->start_msec);
ipng_hist_add(&counter->duration_hist, elapsed_ms);
counter->duration_sum_ms += elapsed_ms;
if (r->upstream_states && r->upstream_states->nelts > 0) {
ngx_msec_int_t up_ms = ipng_upstream_total_ms(r);
ipng_hist_add(&counter->upstream_hist, up_ms);
counter->upstream_sum_ms += up_ms;
}
return NGX_OK;
}
Nothing here touches shared memory. ipng_worker_slot resolves a private table slot using a small per-worker hash keyed by
(source_id, vip_id, code_idx). VIP lookup is cached on the connection so that keep-alive requests reuse the resolved ID.
Flush Timer
At the interval configured by ipng_stats_flush_interval (default 1s), the worker:
- Iterates its dirty-slot list (slots touched since the previous flush).
- For each dirty slot, computes the delta versus the last flushed snapshot stored in the same slot.
- Applies the delta to the shared-zone slot using 64-bit relaxed
fetch_addon each counter lane. - Updates EWMAs from the delta.
- Clears the dirty list (not the slot itself; slot state is preserved so the next flush can compute deltas again).
The worker never walks the entire table — only dirty slots — so idle VIPs cost nothing.
Scrape Handler
The ipng_stats handler is a leaf content handler. It:
- Parses
?source=and?vip=into exact-match filters. - Parses
Accept:to pick output format. - Walks the shared-memory zone under a shared lock (readers hold the read side of a rwlock; flushes and interners hold the write side briefly).
- Emits each matching key in the chosen format directly into an nginx chain buffer.
Output buffering and sending are standard nginx content handler code. The handler does not allocate during the walk; it uses a fixed-size buffer per chain link and requests new links only when full.
Presents and Consumes
Presents.
- One nginx content handler,
ipng_stats, usable in anylocationblock. Serves Prometheus text and JSON, filtered by optional query parameters. - Two new
listenparameters,device=andsource=, usable anywhere alistendirective is used. - Five new
http-level directives:ipng_stats_zone,ipng_stats_flush_interval,ipng_stats_default_source,ipng_stats_buckets,ipng_stats(on/off). - A Prometheus metric family prefixed
nginx_ipng_*, labelledsource,vip, and (for request counters)code.
Consumes.
- An nginx shared-memory zone declared by
ipng_stats_zone. The zone is allocated from nginx's own shared-memory pool. - The Linux
SO_BINDTODEVICEsocket option, applied in the nginx master process during bind. - The nginx log phase and connection structures — standard module embedding, no private kernel calls.
The Debian package
libnginx-mod-http-ipng-stats is the packaging wrapper. There is no ambition to build RPMs, Alpine packages, or a Homebrew formula;
Debian is the target and upstream nginx on Debian is the platform.
Responsibilities
- Build the module against the target release's nginx-dev headers with
--with-compat(NFR-5.1, NFR-5.3). - Install the compiled
.sointo/usr/lib/nginx/modules(FR-6.3). - Drop a
load_modulestanza into/etc/nginx/modules-available/and enable it by default via a symlink inmodules-enabled/(FR-6.3). - Sanity-check the resulting config with
nginx -tin the postinst and back out cleanly if it fails (FR-6.4).
Build
The build is a plain debian/rules invocation that:
- Fetches the nginx source for the installed
nginx-devversion. - Runs
./configure --with-compat --add-dynamic-module=...pointed at the module tree. - Builds only the module (
make modules). - Installs the resulting
.sointo the package tree.
No nginx binary is produced, shipped, or touched. The package is strictly additive.
Presents and Consumes
Presents.
- One Debian package per supported release.
- One dynamic module loadable into stock upstream nginx.
Consumes.
- The target release's
nginx-devpackage at build time. - The running
nginxpackage at install time, fornginx -tvalidation.
Operational Concerns
Deployment Topology
A typical deployment on a single nginx host looks like:
- One GRE tunnel per maglev instance, terminated on the nginx host by the operator's networking layer (systemd-networkd, Netplan, or a
hand-rolled interface config). Interface names follow a consistent pattern, typically
gre-<tag>— e.g.gre-mg1,gre-mg2. - VIPs bound to a local dummy or loopback interface so the kernel accepts inner packets destined for them.
- A hand-maintained
listeninclude file with one device-bound listen per(family, tunnel)pair, reused across vhosts. - Fallback
listen 80;andlisten [::]:80;in whichever server blocks serve direct web traffic. - A single scrape location, e.g.
location = /ipng-stats, served from a locked-down server block that only allows the maglev fleet and the local Prometheus scraper.
Configuration
A minimal working configuration is about fifteen lines:
load_module modules/ngx_http_ipng_stats_module.so;
http {
ipng_stats_zone ipng:4m;
server {
listen 80;
listen [::]:80;
include /etc/nginx/ipng-maglev/listens.conf;
server_name _;
# ... normal vhost content
}
server {
listen 127.0.0.1:9113;
location = /ipng-stats {
ipng_stats;
allow 127.0.0.1;
allow 2001:db8::/48; # maglev fleet
deny all;
}
}
}
listens.conf is eight lines (two families × four maglevs) and stable across vhost changes.
Nginx Reload Semantics
nginx -s reload forks fresh workers, has old workers finish in-flight requests, and then shuts the old workers down. The plugin's
shared-memory zone is declared by name, which survives the reload; new workers attach to the same zone and continue accumulating
counters against the same keys. Counters MUST NOT reset on reload (NFR-4.1).
Source tags are recomputed from the new configuration on reload (NFR-4.3). Renaming a tag in configuration means new traffic appears under the new name; the old name lingers in the zone until either operator restart or an LRU eviction policy ages it out (this is one of the open questions below).
Observability of the Plugin Itself
The plugin emits a handful of meta-metrics on the same scrape endpoint:
nginx_ipng_zone_bytes_used/nginx_ipng_zone_bytes_total— zone high-water and capacity.nginx_ipng_zone_full_events_total— number of key insertions that were dropped because the zone was full.nginx_ipng_flushes_total— number of per-worker flush ticks that have run.nginx_ipng_flush_duration_seconds— histogram of flush durations.nginx_ipng_scrape_duration_seconds— histogram of scrape handler durations.
These make it possible to alert on "the module is running hot" and "the zone is full" without having to run a second scraper against some other endpoint.
Failure Modes
- Shared zone full. New keys are dropped, a counter is incremented, a rate-limited warning is logged, and the operator is expected to resize the zone. Existing keys continue updating normally (NFR-3.1).
- Worker crash. The crashed worker's private counter deltas since its last flush are lost. The shared zone is unaffected. Since the default flush interval is one second, the worst-case data loss is one second of that worker's traffic. This is acceptable for an observability plane.
- nginx master crash / package upgrade. The shared zone is torn down with the old master. When the new master starts, the zone is recreated empty. Counters start from zero. Consumers that need history SHOULD read from Prometheus, which retains history across restarts.
- Device disappears. If an operator removes a GRE tunnel without removing its
listenline, nginx's bind will fail on the next reload and the reload will error cleanly. The module does not hide this; a failingnginx -tis the right answer. - Traffic on a wildcard listener that should have been device-bound. The traffic is counted under
direct(or the configured default). This is detectable: if the operator expects zero traffic underdirectand the dashboard shows non-zero, a maglev instance is probably missing from thelisteninclude. - Slow scrape on a large zone. Scrape cost is linear in the number of keys (NFR-2.3). On a host with a very large VIP count, the operator SHOULD increase the flush interval, lower the scrape frequency, or both. The module does not cap scrape runtime.
- Maglev frontend is down. The module is unaffected; its counters continue to increment and the Prometheus scrape continues to work. When the frontend comes back, it resumes fetching. No state is lost.
Security
- Capabilities. The module needs no capabilities beyond what nginx already has.
SO_BINDTODEVICEis called by the master during bind; workers never call it (NFR-6.1). - Scrape access control. The operator MUST place the scrape
locationbehind anallow/denyACL. The module does not ship auth; this is deliberate, and documented (NFR-6.2). - No PII. The module records only aggregate per-VIP counters. Client IP, request path, headers, and cookies are not touched. Access-log-style observation belongs in nginx's own access log (NFR-6.3).
- Zone sizing as a soft DoS mitigation. Because new keys are dropped when the zone is full rather than allocating unbounded memory, a stream of bogus traffic cannot cause the module to exhaust nginx's memory. The tradeoff is that a real new VIP added after zone exhaustion won't be tracked until the operator resizes — explicit and visible in the meta-metrics.
Alternatives Considered
- OpenResty +
lua-nginx-module+nginx-lua-prometheus. Rejected. Adds a large runtime dependency just for a narrow feature. The deployment target is stock upstream nginx on Debian, and shipping an entirely different nginx build would defeat half the point of packaging. - Access log tailing sidecar. Rejected. Decoupled but introduces a second deploy unit, a log-rotation race, and a synchronization gap between access log truncation and counter accuracy. Also loses live EWMAs.
nginx-module-vts. Considered. VTS is a perfectly good general-purpose metric module, but it has no concept of "which ingress interface did this request come in on", which is the entire innovation here. Adapting VTS to attribute by ingress interface would be a bigger diff than writing a purpose-built module.- Attribution via CONNMARK on a single shared GRE tunnel. Rejected after investigation. Netfilter loses the outer GRE source during
decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked,
SO_MARKon an accepted socket does not reflect incoming packet or conntrack mark without a per-packetlibnetfilter_conntracklookup, which is too heavy for a log-phase handler. - Attribution via multiple GRE tunnels and CONNMARK. Rejected as strictly worse than
SO_BINDTODEVICE: it still requires per-maglev tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency.SO_BINDTODEVICEsolves the same problem with kernel primitives nginx already knows about. - Attribution via eBPF
SO_REUSEPORTprograms. Rejected as dramatic overkill for a problem the kernel already solves for free via socket-lookup specificity. - Per-VIP enumeration in
listendirectives. Rejected in favor of wildcardlisten 80 device=gre-mg1;. The wildcard form works because nginx routes byserver_namepost-accept, so thelistenonly needs to express(port, device)and does not need the VIP address. This makes the generated include file size independent of the VIP count. - Pushing counters from the module into
maglevdover gRPC. Rejected. It inverts the wait-for graph (maglevd's design doc is careful to keep the daemon free of callbacks from the backends), it complicates restart neutrality, and it adds a gRPC client to a C module. Pull-based scrape keeps maglevd out of the traffic-metrics business, matches the doc's philosophy, and lets the frontend use its existing per-server goroutine model. - Shipping separate JSON and Prometheus handlers. Rejected. Content negotiation on one handler is simpler to configure and serves both audiences from one ACL.
Decisions Deferred Post-v0.1
- Histogram bucket overrides per
sourceor pervip. v0.1 keeps FR-2.3's module-level set. If a single nginx instance ever serves both latency-sensitive (API) and bulk (download) traffic on the same host such that one bucket set is too compromised, making buckets per-sourceor per-vipis possible but multiplies memory and complicates Prometheus output. - TLS handshake metrics. The module reports
request_durationfrom the start of the HTTP request, not from TCP accept. For TLS-terminating frontends a handshake-time fraction is invisible. Adding atls_handshake_durationhistogram is deferred until operators ask for it. maglevd-frontendfetch cadence. Whichever cadence the frontend adopts for traffic counters — the existing ~one-second refresh, or an SSE bridge layered on top — the plugin supports it. The choice is on the frontend side.