commit c05bcf6aa61df0c9b766ed3f7fb0552d939cfd53 Author: Pim van Pelt Date: Thu Apr 16 02:12:56 2026 +0200 Add designdoc and AP2.0 license for this nginx module diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..a81d042 --- /dev/null +++ b/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for describing the origin of the Work and + reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright 2026 Pim van Pelt + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/docs/design.md b/docs/design.md new file mode 100644 index 0000000..cb8008e --- /dev/null +++ b/docs/design.md @@ -0,0 +1,613 @@ +# nginx-vpp-maglev-plugin Design Document + +## Metadata + +| | | +| --- | --- | +| **Status** | Draft — describes intended behavior for `v0.1.0` | +| **Author** | Pim van Pelt `` | +| **Last updated** | 2026-04-16 | +| **Audience** | Operators and contributors building the nginx-side observability half of `vpp-maglev` | + +The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and **MAY** are used as described in +[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are reserved in this document for requirements that are intended to be +enforced in code or by an external dependency. Plain-language descriptions of what the system or an operator can do are written in +lowercase — "can", "will", "does" — and should not be read as normative. + +## Summary + +`nginx-vpp-maglev-plugin` is a dynamic nginx module and its surrounding Debian packaging. Loaded into stock upstream nginx, the module records +per-VIP traffic counters — requests, status codes, bytes, latency — and attributes them to the specific `vpp-maglev` instance whose GRE +tunnel delivered each connection. A small HTTP scrape endpoint exposes the counters as both Prometheus text and JSON so that +`maglevd-frontend`, Prometheus, and ad-hoc `curl` sessions can all read the same data. The module is the nginx-side answer to the open +question in [`vpp-maglev/docs/design.md`](../../vpp-maglev/docs/design.md) about per-backend traffic counters: VPP's `lb` plugin bypasses +the FIB and cannot produce them, so the backends report what they see. + +## Background + +`vpp-maglev` programs VPP's `lb` plugin so that traffic hashed to a VIP lands on a pool of healthy Application Servers (ASes). For the +deployment this module targets, every AS is an nginx instance receiving GRE-encapsulated traffic from one or more `maglevd` daemons, +decapsulating it, and terminating or proxying HTTP and HTTPS as it would for any other inbound client. + +The design document for `vpp-maglev` identifies **per-AS traffic counters** as an explicit open question: VPP's `lb` fast path bypasses +the FIB, so VPP exposes per-VIP counters in the stats segment but not per-backend ones. An operator looking at the `maglevd-frontend` +status page for a frontend with four backends can see the frontend's aggregate packet rate but not which backend is carrying how much of +it, which errors are concentrated on which backend, or whether one backend's p95 latency is drifting. + +This project closes that gap from the opposite end. The nginx instances that serve the traffic already observe everything an operator +wants to see — they are the authoritative source for request rate, response code mix, bytes moved, and latency distributions. A small +in-process module emits those numbers on an HTTP endpoint, and `maglevd-frontend` fans out to the backends of each frontend and aggregates +the result into the existing status page. + +## Goals and Non-Goals + +### Product Goals + +1. **Per-VIP, per-maglev traffic visibility.** For each VIP, the module records request count, status-code distribution, bytes in and out, + and request-duration histograms, split by which `maglevd` instance delivered the traffic. +2. **Negligible hot-path cost.** At steady state, a request traversing an nginx worker with the module loaded pays at most a handful of + non-atomic integer increments and a histogram bucket update. No locks, no allocations, no system calls. +3. **Two readers, one endpoint.** A single HTTP location serves both Prometheus text and JSON, so a site running Prometheus and a site + using only the `maglevd-frontend` UI can both consume the module without extra configuration. +4. **Packaging as a dynamic module.** The module builds with nginx's `--with-compat` ABI and ships as a Debian package that loads into + stock upstream nginx without recompiling nginx itself. +5. **Composable with normal nginx use.** A host running the module as a maglev backend **and** serving unrelated direct web traffic on the + same ports MUST remain a correct nginx deployment. The module MUST NOT change the semantics of any existing directive; it only adds new + parameters and directives that are no-ops when unused. +6. **Graceful reload.** An `nginx -s reload` MUST NOT reset counters, lose history, or drop in-flight connections from the module's point + of view. + +### Non-Goals + +- The module is **not** a generic nginx metrics exporter. It does not aim to replace `nginx-module-vts`, `ngx_http_stub_status`, or + `nginx-lua-prometheus`. Its metric set is deliberately narrow and shaped by the `maglevd-frontend` status page. +- The module does **not** terminate TLS, rewrite headers, or alter the request in any way. It is observation-only. +- The module does **not** talk to `maglevd` directly. It does not initiate gRPC, it does not read maglev configuration, and it does not + know which maglev instance owns which VIP. The attribution tag it emits is a string supplied by the operator in the `listen` directive; + nothing more. +- The module does **not** provide per-client-IP, per-path, or per-User-Agent counters. Those dimensions explode cardinality and belong in + access logs and existing log-analysis tools. +- The module does **not** provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on + restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus. +- The module does **not** own the GRE tunnels, the VIP addresses, or the `SO_BINDTODEVICE` privilege. Tunnel creation, VIP binding, and + nginx master privileges are the operator's responsibility. + +## Requirements + +Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that later sections can cite it. + +### Functional Requirements + +**FR-1 Attribution** + +- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=`, which causes the resulting + listening socket to be created with `SO_BINDTODEVICE` set to the named interface. A listen directive without `device=` MUST create a + plain listening socket as stock nginx does. +- **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `source=`, which attaches a short string tag to + the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener. +- **FR-1.3** A listening socket with neither `device=` nor `source=` MUST be tagged with the configured default source string (see + `ipng_stats_default_source`, FR-5.3). The default default is the literal string `direct`. +- **FR-1.4** A listening socket with `device=X` but no `source=` MUST be tagged with the interface name `X`. +- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist, and the kernel's TCP socket lookup + rules MUST be relied on to dispatch each SYN to the most specific match. The module MUST NOT attempt to duplicate this logic in + userspace. +- **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=` MUST accept only + connections whose ingress interface is ``, for any local address served through that interface. This is the intended deployment + shape: wildcard fallback plus per-tunnel device-bound listeners. + +**FR-2 Counters** + +- **FR-2.1** The module MUST maintain, for every observed `(source, vip, status_code)` tuple, the following counters: total requests, + total bytes received (sum of request bytes including request line, headers, and body), total bytes sent (sum of response bytes + including status line, headers, and body), and a fixed-bucket histogram of request duration in milliseconds. +- **FR-2.2** When an upstream is used to serve the request, the module MUST additionally maintain a fixed-bucket histogram of upstream + response time in milliseconds, keyed by the same `(source, vip)` pair. +- **FR-2.3** The histogram bucket boundaries MUST be fixed at module initialization and MUST be the same for every `(source, vip)` key. + The default boundaries are `{1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000}` milliseconds plus an implicit `+Inf` bucket. + Operators MAY override the boundaries via the `ipng_stats_buckets` directive at the `http` level. +- **FR-2.4** The module MUST additionally maintain, per `(source, vip)` pair, exponentially-weighted moving averages for instantaneous + request rate with decay windows of 1 second, 10 seconds, and 60 seconds. EWMAs are updated from the periodic flush tick (see FR-4.2), + not from the request path. +- **FR-2.5** The `vip` dimension of every counter MUST be the connection's `$server_addr` in its canonical textual form (dotted-quad for + IPv4, RFC 5952 lowercase-compressed form for IPv6). IPv6 zone identifiers (scope-ids), if any, MUST be stripped during canonicalization; + link-local VIPs (which are not expected in practice) are attributed under their scope-less textual form. Port is not part of the key; + a VIP that listens on both 80 and 443 MUST be aggregated. +- **FR-2.6** The `status_code` dimension MUST be the full three-digit HTTP status code as recorded by nginx at log phase. The module MUST + NOT bucket codes into classes (2xx/3xx/4xx/5xx); bucketing is the consumer's job. + +**FR-3 Scrape endpoint** + +- **FR-3.1** The module MUST provide a new nginx handler directive, `ipng_stats;`, that, when placed in a `location` block, causes that + location to serve the module's counters and MUST NOT be combinable with other content handlers in the same location. +- **FR-3.2** The `ipng_stats` handler MUST support content negotiation via the `Accept` request header: + - `Accept: application/json` → JSON output. + - `Accept: text/plain` (or anything else, including absent) → Prometheus text exposition format. +- **FR-3.3** The handler MUST support a `source=` query parameter that filters the output to only counters whose source dimension + equals the supplied tag. The comparison is exact-match and case-sensitive. +- **FR-3.4** The handler MUST support a `vip=
` query parameter that filters the output to only counters whose VIP dimension + equals the supplied address. The comparison uses the canonicalized form of FR-2.5. +- **FR-3.5** Both filters MAY be supplied together; their effect is the intersection. +- **FR-3.6** The JSON schema MUST be documented in `docs/scrape-api.md` and MUST version via a top-level `schema` field so that breaking + changes can be made additively without bricking existing consumers. +- **FR-3.7** The Prometheus text output MUST use stable metric names prefixed with `nginx_ipng_` and MUST label every series with `source` + and `vip`. Counter metrics additionally carry a `code` label. + +**FR-4 Hot path and flush** + +- **FR-4.1** Per-request counter updates MUST occur in the nginx log phase and MUST be localized to the current worker's private counter + table. The module MUST NOT take any locks on the request path and MUST NOT issue any atomic operation on the request path. +- **FR-4.2** Each worker MUST run a periodic timer, default one second, that flushes the worker's private counter deltas into the + shared-memory zone using atomic adds. The flush interval is configurable via the `ipng_stats_flush_interval` directive. +- **FR-4.3** The scrape handler MUST read only from the shared-memory zone. Workers MUST NOT read from each other's private tables. +- **FR-4.4** Histogram updates MUST be branch-light: the module MUST precompute a small lookup that maps elapsed milliseconds to a bucket + index using binary search over the fixed boundary array, and MUST NOT scan the array linearly. + +**FR-5 Directives** + +- **FR-5.1** `ipng_stats_zone name:size` at the `http` level declares the shared-memory zone the module uses. `name` is the zone name (no + default); `size` is a size with suffix (`k`, `m`). The directive is mandatory if the module is loaded. +- **FR-5.2** `ipng_stats_flush_interval ` at the `http` level sets the worker flush cadence. Default `1s`. Minimum `100ms`. +- **FR-5.3** `ipng_stats_default_source ` at the `http` level sets the tag applied to listening sockets that have neither `device=` + nor `source=`. Default `direct`. +- **FR-5.4** `ipng_stats_buckets ` at the `http` level overrides the default histogram bucket boundaries. Values MUST be + strictly increasing positive integers. +- **FR-5.5** `ipng_stats on|off` at the `http`, `server`, or `location` level opts a context into or out of counting. Default `on` at the + `http` level when the module is loaded. A location serving the `ipng_stats` handler MUST NOT have itself counted (the module + automatically sets `off` for the scrape location). + +**FR-6 Packaging** + +- **FR-6.1** The module MUST build as a dynamic module using nginx's `--with-compat --add-dynamic-module=...` flow, against the nginx-dev + headers of the target Debian release, so that the resulting `.so` loads into stock upstream nginx on that release without rebuilding + nginx itself. +- **FR-6.2** The module MUST ship as a Debian package named `libnginx-mod-http-ipng-stats`, following the `libnginx-mod-http-*` naming + convention used by existing third-party nginx modules packaged for Debian. +- **FR-6.3** The package MUST install: + - `/usr/lib/nginx/modules/ngx_http_ipng_stats_module.so` + - `/etc/nginx/modules-available/50-mod-http-ipng-stats.conf` containing the `load_module` directive. + - A symlink `/etc/nginx/modules-enabled/50-mod-http-ipng-stats.conf → ../modules-available/50-mod-http-ipng-stats.conf` created in the + package's postinst. +- **FR-6.4** The package postinst MUST run `nginx -t` after installing the module. If the test fails, postinst MUST remove the + `modules-enabled` symlink and report a non-fatal warning so that a broken upgrade does not leave the operator's nginx unable to start. + +### Non-Functional Requirements + +**NFR-1 Correctness under concurrency** + +- **NFR-1.1** Per-worker counter tables MUST be owned exclusively by their worker and MUST NOT be read or written by any other worker, + any handler, or any timer other than the worker's own flush timer. +- **NFR-1.2** Flushes from workers into the shared zone MUST use relaxed atomic `fetch_add` on 64-bit lanes. The module MUST NOT rely on + `memset`, `memcpy`, or any unaligned access for shared-zone updates. +- **NFR-1.3** A scrape that races with a flush MUST observe a monotonically non-decreasing counter value; temporary readings that see + partial flushes across different keys are acceptable, but a single counter MUST never appear to decrease. +- **NFR-1.4** Histogram bucket counts and sum/count fields MUST be updated in a way that a concurrent scrape never observes + `count < sum-of-buckets`. This is achieved by updating bucket counts before the sum/count and by a scraper that reads sum/count before + bucket counts. + +**NFR-2 Hot-path cost** + +- **NFR-2.1** The per-request cost of the log-phase handler MUST be bounded by: one listening-socket pointer deref, one VIP pointer deref + (cached on the connection struct), a constant-time status-code index computation, a constant number of integer increments, and a + `O(log B)` histogram binary search where `B` is the number of buckets. No syscalls, no allocations, no locks. +- **NFR-2.2** The per-flush cost per worker MUST be bounded by `O(K)` atomic adds, where `K` is the number of distinct + `(source, vip, code)` keys touched by that worker since the last flush. Keys untouched during an interval MUST NOT be visited. +- **NFR-2.3** The scrape cost MUST be bounded by `O(K_total)` reads from the shared zone plus `O(K_total)` string format operations, + where `K_total` is the number of distinct keys in the zone. + +**NFR-3 Memory bounds** + +- **NFR-3.1** The shared-memory zone MUST be sized by the operator at module-load time (FR-5.1) and MUST NOT grow beyond that size. When + the zone is full, the module MUST drop new keys, increment a dedicated `nginx_ipng_zone_full_events_total` counter, and log at `warn` + level no more than once per minute per worker. +- **NFR-3.2** The per-worker private counter table MUST be bounded by the same total key count the shared zone admits. A worker MUST NOT + accumulate private state that exceeds the shared-zone capacity. +- **NFR-3.3** The set of distinct status codes observed is small (typically ≤ 60) and MUST NOT be allowed to explode due to non-standard + responses; the module MUST clamp any observed code `< 100` or `>= 600` into a single bucket labeled `code="unknown"` rather than + allocating a new key. + +**NFR-4 Reload neutrality** + +- **NFR-4.1** `nginx -s reload` spawns a new set of workers while the old workers drain. The shared-memory zone MUST survive this + transition; counters MUST NOT reset on reload. +- **NFR-4.2** New workers MUST attach to the existing shared-memory zone under the same name, reconstruct their private counter tables + lazily from observed traffic, and resume flushing. +- **NFR-4.3** The `source` tag for any given listening socket is recomputed at reload time from the new configuration. If the operator + renames a tag, new traffic MUST use the new tag. +- **NFR-4.4** When a `source` tag is no longer present in any listening socket after a configuration reload, its counters MUST be + evicted from the shared-memory zone on the first flush tick following the reload. The module MUST NOT retain historical counters under + defunct tags indefinitely. Rename is expected to be rare and evicting the old entries immediately is acceptable. + +**NFR-5 Packaging robustness** + +- **NFR-5.1** The module MUST compile cleanly against the nginx-dev headers of the currently supported Debian stable and testing + releases. CI MUST build one `.deb` per supported release and MUST fail if any target breaks. +- **NFR-5.2** The module MUST NOT depend on any shared library beyond `libc` and nginx's own runtime. No `libnetfilter_*`, no `libcurl`, + no `libjson*`. +- **NFR-5.3** A version mismatch between the `.so` and the installed nginx binary MUST be detected by nginx at load time (this is the + purpose of `--with-compat`). The package postinst MUST NOT attempt to work around a mismatch; it reports the failure and leaves the + operator to upgrade the nginx package. + +**NFR-6 Security** + +- **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. The `SO_BINDTODEVICE` call is made + in the nginx master process which is already privileged during the bind step; workers never call `setsockopt(SO_BINDTODEVICE)` + themselves. +- **NFR-6.2** The scrape endpoint MUST be accessible only via an `allow`/`deny` ACL placed in the operator's nginx config. The module + MUST NOT ship its own authentication or authorization; it is a plain nginx handler and inherits the enclosing `location` block's access + controls. +- **NFR-6.3** The module MUST NOT log client IPs, request paths, `User-Agent`, or any other per-request personally-identifying field. It + logs only aggregate counters and its own operational events. + +**NFR-7 Documentation** + +- **NFR-7.1** The repository MUST ship a `docs/user-guide.md` that walks an operator through installing the Debian package, loading the + module, configuring a minimal end-to-end deployment (GRE tunnels, VIPs, `listen` lines, scrape endpoint), verifying that counters are + flowing, and integrating the scrape endpoint with both `maglevd-frontend` and a standalone Prometheus scraper. The user guide is the + document an operator reads once to get from a freshly-installed package to a working, observable deployment. +- **NFR-7.2** The repository MUST ship a `docs/config-guide.md` that enumerates every directive and `listen` parameter introduced by the + module, together with the nginx configuration contexts (`http`, `server`, `location`, or `listen`) in which each is legal, the allowed + values, the default, and a one-sentence summary of behavior. The config guide is the document an operator greps when they need to know + where a given knob is allowed to appear. + +## Architecture Overview + +### Process Model + +The project ships one dynamic nginx module: + +- **`ngx_http_ipng_stats_module.so`** — the dynamic module, loaded by nginx's master at startup via `load_module`. It runs entirely inside + the nginx process model: code executes in nginx workers during the request lifecycle and during per-worker timers. No separate process + is launched. + +There is no daemon, no socket the module listens on, no control plane. Everything the module does is done inline with nginx. + +### Data Flow + +Requests enter nginx through one of two listener classes: + +1. **Device-bound listeners** (`listen ... device=X source=Y`) accept only connections whose ingress interface is `X`. Each is tagged + with a source string `Y`. +2. **Wildcard fallback listeners** (`listen 80;`, `listen [::]:80;`) accept everything that didn't match a more specific listener. They + are tagged with the configured default source (FR-1.3). + +During request processing nginx behaves exactly as it would without the module: no handler runs early, no header is rewritten. At log +phase, the module's log-phase handler increments the worker-local counter table keyed by `(source, vip, status_code)`. + +A per-worker timer, firing at the configured flush interval (FR-5.2), walks the dirty keys in the worker-local table and applies their +deltas to the shared-memory zone via atomic adds. + +The scrape handler, when invoked at `GET /ipng-stats` (or whatever location the operator chose), reads the shared-memory zone directly +and formats the output per the requested content type. + +`maglevd-frontend` fetches the scrape endpoint of each backend in its configured fleet at roughly the same cadence it already uses for +maglevd state. It filters server-side via `?source=` so that it only sees the traffic it delivered. The aggregated view is +rendered alongside the existing maglev status page. + +No component in this project writes to anything outside nginx's own memory. In particular, the module does not touch the file system, +does not emit log lines on the request path, and does not speak to any upstream. + +## Components + +### The nginx module + +`ngx_http_ipng_stats_module` is the entire technical surface of this project. It is a single C module conforming to nginx's +dynamic-module ABI. + +#### Responsibilities + +- Parse new `listen` parameters `device=` and `source=` and attach their values to each listening socket's config (FR-1.1, FR-1.2). +- Call `setsockopt(SO_BINDTODEVICE)` in the master process at bind time for listeners that set `device=` (FR-1.1, NFR-6.1). +- Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_code)` (FR-2.1, NFR-1.1). +- Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2). +- Update EWMAs at flush time (FR-2.4). +- Serve the scrape endpoint with content negotiation and optional filters (FR-3). +- Honor `ipng_stats on|off` at any config level (FR-5.5). + +#### Attribution Model + +The module's single novel idea is that per-maglev attribution is done by the Linux kernel's TCP socket lookup, not by any userspace +inspection. Each `maglevd` instance terminates its GRE tunnel on a dedicated interface on the nginx host; the operator writes one +`listen ... device= source=` line per `(family, tunnel)` pair. The kernel binds that listening socket with `SO_BINDTODEVICE`, +which causes it to match only connections whose ingress interface is that tunnel. A wildcard `listen 80;` and `listen [::]:80;` pair +provides the fallback for traffic arriving on any other interface — typically normal web traffic, not from maglev. + +The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback +and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to. + +Because the `device=` binding uses a wildcard address, the module does not need to know the set of VIPs served through each tunnel. +Adding a VIP (binding an address to `lo` and writing a new `server_name` block) does not require touching the `listen` lines. Adding a +new maglev instance (a new GRE tunnel) does. This is the correct split: VIPs are vhost-level concerns and change often; maglev instances +are fleet-level concerns and change rarely. + +The design assumes GRE tunnels used as `device=` sources carry **only** maglev-originated traffic. Any other traffic arriving on such an +interface is silently misattributed to that maglev's source tag. This is a deployment invariant, not a defect. + +#### Counter Data Model + +Counters are stored as a flat hash table in a shared-memory zone. The key is the tuple `(source_id, vip_id, status_code)` where +`source_id` and `vip_id` are small integers assigned at first observation and reused thereafter. The value is a fixed-size record +containing: + +- `requests` (u64) +- `bytes_in` (u64) +- `bytes_out` (u64) +- `duration_hist` — `B+1` u64 lanes (one per bucket plus the `+Inf` bucket) +- `duration_sum_ms` (u64) +- `upstream_hist` — same shape, only updated when an upstream served the request +- `upstream_sum_ms` (u64) + +A parallel table keyed by `(source_id, vip_id)` — one row per VIP — holds the EWMAs for instantaneous rate. EWMAs are floats but updated +only from the flush tick, so there is no float contention on the request path. + +The module also keeps a small string interning table for source and VIP strings, keyed by the integer IDs above, so that the scrape +endpoint can recover the original strings without re-parsing configuration. + +String interning is capacity-bounded: the zone is sized by the operator, and once capacity is exhausted new keys are dropped with a +counter bump and an infrequent log line (NFR-3.1). In practice, the number of distinct VIPs on a single nginx host is small (tens, maybe +low hundreds), and the number of distinct source tags is the number of maglev instances (single digits). The dominant factor is +`status_code`; ~60 keys per VIP is a typical steady state. + +#### Hot Path + +The log-phase handler is deliberately short. Pseudocode: + +```c +static ngx_int_t +ipng_stats_log_handler(ngx_http_request_t *r) +{ + ipng_listen_ctx_t *lctx; + ipng_counter_t *counter; + ngx_msec_int_t elapsed_ms; + ngx_uint_t code_idx; + + if (!ipng_stats_enabled(r)) { + return NGX_OK; + } + + lctx = ngx_http_ipng_stats_listen_ctx(r->connection->listening); + /* lctx contains source_id and the cached VIP id, + or resolves VIP lazily on first seen address */ + + code_idx = ipng_status_to_index(r->headers_out.status); + counter = ipng_worker_slot(lctx, r->connection->local_sockaddr, code_idx); + + counter->requests++; + counter->bytes_in += r->request_length; + counter->bytes_out += r->connection->sent; + + elapsed_ms = (ngx_msec_int_t)(ngx_current_msec - r->start_msec); + ipng_hist_add(&counter->duration_hist, elapsed_ms); + counter->duration_sum_ms += elapsed_ms; + + if (r->upstream_states && r->upstream_states->nelts > 0) { + ngx_msec_int_t up_ms = ipng_upstream_total_ms(r); + ipng_hist_add(&counter->upstream_hist, up_ms); + counter->upstream_sum_ms += up_ms; + } + + return NGX_OK; +} +``` + +Nothing here touches shared memory. `ipng_worker_slot` resolves a private table slot using a small per-worker hash keyed by +`(source_id, vip_id, code_idx)`. VIP lookup is cached on the connection so that keep-alive requests reuse the resolved ID. + +#### Flush Timer + +At the interval configured by `ipng_stats_flush_interval` (default 1s), the worker: + +1. Iterates its dirty-slot list (slots touched since the previous flush). +2. For each dirty slot, computes the delta versus the last flushed snapshot stored in the same slot. +3. Applies the delta to the shared-zone slot using 64-bit relaxed `fetch_add` on each counter lane. +4. Updates EWMAs from the delta. +5. Clears the dirty list (not the slot itself; slot state is preserved so the next flush can compute deltas again). + +The worker never walks the entire table — only dirty slots — so idle VIPs cost nothing. + +#### Scrape Handler + +The `ipng_stats` handler is a leaf content handler. It: + +1. Parses `?source=` and `?vip=` into exact-match filters. +2. Parses `Accept:` to pick output format. +3. Walks the shared-memory zone under a shared lock (readers hold the read side of a rwlock; flushes and interners hold the write side + briefly). +4. Emits each matching key in the chosen format directly into an nginx chain buffer. + +Output buffering and sending are standard nginx content handler code. The handler does not allocate during the walk; it uses a +fixed-size buffer per chain link and requests new links only when full. + +#### Presents and Consumes + +**Presents.** + +- **One nginx content handler**, `ipng_stats`, usable in any `location` block. Serves Prometheus text and JSON, filtered by optional + query parameters. +- **Two new `listen` parameters**, `device=` and `source=`, usable anywhere a `listen` directive is used. +- **Five new `http`-level directives**: `ipng_stats_zone`, `ipng_stats_flush_interval`, `ipng_stats_default_source`, + `ipng_stats_buckets`, `ipng_stats` (on/off). +- **A Prometheus metric family** prefixed `nginx_ipng_*`, labelled `source`, `vip`, and (for request counters) `code`. + +**Consumes.** + +- **An nginx shared-memory zone** declared by `ipng_stats_zone`. The zone is allocated from nginx's own shared-memory pool. +- **The Linux `SO_BINDTODEVICE` socket option**, applied in the nginx master process during bind. +- **The nginx log phase and connection structures** — standard module embedding, no private kernel calls. + +### The Debian package + +`libnginx-mod-http-ipng-stats` is the packaging wrapper. There is no ambition to build RPMs, Alpine packages, or a Homebrew formula; +Debian is the target and upstream nginx on Debian is the platform. + +#### Responsibilities + +- Build the module against the target release's nginx-dev headers with `--with-compat` (NFR-5.1, NFR-5.3). +- Install the compiled `.so` into `/usr/lib/nginx/modules` (FR-6.3). +- Drop a `load_module` stanza into `/etc/nginx/modules-available/` and enable it by default via a symlink in `modules-enabled/` + (FR-6.3). +- Sanity-check the resulting config with `nginx -t` in the postinst and back out cleanly if it fails (FR-6.4). + +#### Build + +The build is a plain `debian/rules` invocation that: + +1. Fetches the nginx source for the installed `nginx-dev` version. +2. Runs `./configure --with-compat --add-dynamic-module=...` pointed at the module tree. +3. Builds only the module (`make modules`). +4. Installs the resulting `.so` into the package tree. + +No nginx binary is produced, shipped, or touched. The package is strictly additive. + +#### Presents and Consumes + +**Presents.** + +- **One Debian package** per supported release. +- **One dynamic module** loadable into stock upstream nginx. + +**Consumes.** + +- **The target release's `nginx-dev` package** at build time. +- **The running `nginx` package** at install time, for `nginx -t` validation. + +## Operational Concerns + +### Deployment Topology + +A typical deployment on a single nginx host looks like: + +- One GRE tunnel per maglev instance, terminated on the nginx host by the operator's networking layer (systemd-networkd, Netplan, or a + hand-rolled interface config). Interface names follow a consistent pattern, typically `gre-` — e.g. `gre-mg1`, `gre-mg2`. +- VIPs bound to a local dummy or loopback interface so the kernel accepts inner packets destined for them. +- A hand-maintained `listen` include file with one device-bound listen per `(family, tunnel)` pair, reused across vhosts. +- Fallback `listen 80;` and `listen [::]:80;` in whichever server blocks serve direct web traffic. +- A single scrape location, e.g. `location = /ipng-stats`, served from a locked-down server block that only allows the maglev fleet and + the local Prometheus scraper. + +### Configuration + +A minimal working configuration is about fifteen lines: + +```nginx +load_module modules/ngx_http_ipng_stats_module.so; + +http { + ipng_stats_zone ipng:4m; + + server { + listen 80; + listen [::]:80; + include /etc/nginx/ipng-maglev/listens.conf; + + server_name _; + # ... normal vhost content + } + + server { + listen 127.0.0.1:9113; + location = /ipng-stats { + ipng_stats; + allow 127.0.0.1; + allow 2001:db8::/48; # maglev fleet + deny all; + } + } +} +``` + +`listens.conf` is eight lines (two families × four maglevs) and stable across vhost changes. + +### Nginx Reload Semantics + +`nginx -s reload` forks fresh workers, has old workers finish in-flight requests, and then shuts the old workers down. The plugin's +shared-memory zone is declared by name, which survives the reload; new workers attach to the same zone and continue accumulating +counters against the same keys. Counters MUST NOT reset on reload (NFR-4.1). + +Source tags are recomputed from the new configuration on reload (NFR-4.3). Renaming a tag in configuration means new traffic appears +under the new name; the old name lingers in the zone until either operator restart or an LRU eviction policy ages it out (this is one +of the open questions below). + +### Observability of the Plugin Itself + +The plugin emits a handful of meta-metrics on the same scrape endpoint: + +- `nginx_ipng_zone_bytes_used` / `nginx_ipng_zone_bytes_total` — zone high-water and capacity. +- `nginx_ipng_zone_full_events_total` — number of key insertions that were dropped because the zone was full. +- `nginx_ipng_flushes_total` — number of per-worker flush ticks that have run. +- `nginx_ipng_flush_duration_seconds` — histogram of flush durations. +- `nginx_ipng_scrape_duration_seconds` — histogram of scrape handler durations. + +These make it possible to alert on "the module is running hot" and "the zone is full" without having to run a second scraper against +some other endpoint. + +### Failure Modes + +- **Shared zone full.** New keys are dropped, a counter is incremented, a rate-limited warning is logged, and the operator is expected + to resize the zone. Existing keys continue updating normally (NFR-3.1). +- **Worker crash.** The crashed worker's private counter deltas since its last flush are lost. The shared zone is unaffected. Since the + default flush interval is one second, the worst-case data loss is one second of that worker's traffic. This is acceptable for an + observability plane. +- **nginx master crash / package upgrade.** The shared zone is torn down with the old master. When the new master starts, the zone is + recreated empty. Counters start from zero. Consumers that need history SHOULD read from Prometheus, which retains history across + restarts. +- **Device disappears.** If an operator removes a GRE tunnel without removing its `listen` line, nginx's bind will fail on the next + reload and the reload will error cleanly. The module does not hide this; a failing `nginx -t` is the right answer. +- **Traffic on a wildcard listener that should have been device-bound.** The traffic is counted under `direct` (or the configured + default). This is detectable: if the operator expects zero traffic under `direct` and the dashboard shows non-zero, a maglev instance + is probably missing from the `listen` include. +- **Slow scrape on a large zone.** Scrape cost is linear in the number of keys (NFR-2.3). On a host with a very large VIP count, the + operator SHOULD increase the flush interval, lower the scrape frequency, or both. The module does not cap scrape runtime. +- **Maglev frontend is down.** The module is unaffected; its counters continue to increment and the Prometheus scrape continues to work. + When the frontend comes back, it resumes fetching. No state is lost. + +### Security + +- **Capabilities.** The module needs no capabilities beyond what nginx already has. `SO_BINDTODEVICE` is called by the master during + bind; workers never call it (NFR-6.1). +- **Scrape access control.** The operator MUST place the scrape `location` behind an `allow`/`deny` ACL. The module does not ship auth; + this is deliberate, and documented (NFR-6.2). +- **No PII.** The module records only aggregate per-VIP counters. Client IP, request path, headers, and cookies are not touched. + Access-log-style observation belongs in nginx's own access log (NFR-6.3). +- **Zone sizing as a soft DoS mitigation.** Because new keys are dropped when the zone is full rather than allocating unbounded memory, + a stream of bogus traffic cannot cause the module to exhaust nginx's memory. The tradeoff is that a real new VIP added after zone + exhaustion won't be tracked until the operator resizes — explicit and visible in the meta-metrics. + +## Alternatives Considered + +- **OpenResty + `lua-nginx-module` + `nginx-lua-prometheus`.** Rejected. Adds a large runtime dependency just for a narrow feature. The + deployment target is stock upstream nginx on Debian, and shipping an entirely different nginx build would defeat half the point of + packaging. +- **Access log tailing sidecar.** Rejected. Decoupled but introduces a second deploy unit, a log-rotation race, and a synchronization + gap between access log truncation and counter accuracy. Also loses live EWMAs. +- **`nginx-module-vts`.** Considered. VTS is a perfectly good general-purpose metric module, but it has no concept of "which ingress + interface did this request come in on", which is the entire innovation here. Adapting VTS to attribute by ingress interface would be a + bigger diff than writing a purpose-built module. +- **Attribution via CONNMARK on a single shared GRE tunnel.** Rejected after investigation. Netfilter loses the outer GRE source during + decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked, `SO_MARK` on an + accepted socket does not reflect incoming packet or conntrack mark without a per-packet `libnetfilter_conntrack` lookup, which is too + heavy for a log-phase handler. +- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than `SO_BINDTODEVICE`: it still requires per-maglev + tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `SO_BINDTODEVICE` solves the same problem with + kernel primitives nginx already knows about. +- **Attribution via eBPF `SO_REUSEPORT` programs.** Rejected as dramatic overkill for a problem the kernel already solves for free via + socket-lookup specificity. +- **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1;`. The wildcard form works + because nginx routes by `server_name` post-accept, so the `listen` only needs to express `(port, device)` and does not need the VIP + address. This makes the generated include file size independent of the VIP count. +- **Pushing counters from the module into `maglevd` over gRPC.** Rejected. It inverts the wait-for graph (maglevd's design doc is + careful to keep the daemon free of callbacks from the backends), it complicates restart neutrality, and it adds a gRPC client to a C + module. Pull-based scrape keeps maglevd out of the traffic-metrics business, matches the doc's philosophy, and lets the frontend use + its existing per-server goroutine model. +- **Shipping separate JSON and Prometheus handlers.** Rejected. Content negotiation on one handler is simpler to configure and serves + both audiences from one ACL. + +## Decisions Deferred Post-v0.1 + +- **Histogram bucket overrides per `source` or per `vip`.** v0.1 keeps FR-2.3's module-level set. If a single nginx instance ever serves + both latency-sensitive (API) and bulk (download) traffic on the same host such that one bucket set is too compromised, making buckets + per-`source` or per-`vip` is possible but multiplies memory and complicates Prometheus output. +- **TLS handshake metrics.** The module reports `request_duration` from the start of the HTTP request, not from TCP accept. For + TLS-terminating frontends a handshake-time fraction is invisible. Adding a `tls_handshake_duration` histogram is deferred until + operators ask for it. +- **`maglevd-frontend` fetch cadence.** Whichever cadence the frontend adopts for traffic counters — the existing ~one-second refresh, + or an SSE bridge layered on top — the plugin supports it. The choice is on the frontend side.