Add designdoc and AP2.0 license for this nginx module

2026-04-16 02:12:56 +02:00
commit c05bcf6aa6
2 changed files with 814 additions and 0 deletions
--- a/201
+++ b/201
@@ -0,0 +1,201 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for describing the origin of the Work and
+      reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright 2026 Pim van Pelt <pim@ipng.ch>
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/docs/design.md
+++ b/docs/design.md
@@ -0,0 +1,613 @@
+# nginx-vpp-maglev-plugin Design Document
+
+## Metadata
+
+| | |
+| --- | --- |
+| **Status** | Draft — describes intended behavior for `v0.1.0` |
+| **Author** | Pim van Pelt `<pim@ipng.ch>` |
+| **Last updated** | 2026-04-16 |
+| **Audience** | Operators and contributors building the nginx-side observability half of `vpp-maglev` |
+
+The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and **MAY** are used as described in
+[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are reserved in this document for requirements that are intended to be
+enforced in code or by an external dependency. Plain-language descriptions of what the system or an operator can do are written in
+lowercase — "can", "will", "does" — and should not be read as normative.
+
+## Summary
+
+`nginx-vpp-maglev-plugin` is a dynamic nginx module and its surrounding Debian packaging. Loaded into stock upstream nginx, the module records
+per-VIP traffic counters — requests, status codes, bytes, latency — and attributes them to the specific `vpp-maglev` instance whose GRE
+tunnel delivered each connection. A small HTTP scrape endpoint exposes the counters as both Prometheus text and JSON so that
+`maglevd-frontend`, Prometheus, and ad-hoc `curl` sessions can all read the same data. The module is the nginx-side answer to the open
+question in [`vpp-maglev/docs/design.md`](../../vpp-maglev/docs/design.md) about per-backend traffic counters: VPP's `lb` plugin bypasses
+the FIB and cannot produce them, so the backends report what they see.
+
+## Background
+
+`vpp-maglev` programs VPP's `lb` plugin so that traffic hashed to a VIP lands on a pool of healthy Application Servers (ASes). For the
+deployment this module targets, every AS is an nginx instance receiving GRE-encapsulated traffic from one or more `maglevd` daemons,
+decapsulating it, and terminating or proxying HTTP and HTTPS as it would for any other inbound client.
+
+The design document for `vpp-maglev` identifies **per-AS traffic counters** as an explicit open question: VPP's `lb` fast path bypasses
+the FIB, so VPP exposes per-VIP counters in the stats segment but not per-backend ones. An operator looking at the `maglevd-frontend`
+status page for a frontend with four backends can see the frontend's aggregate packet rate but not which backend is carrying how much of
+it, which errors are concentrated on which backend, or whether one backend's p95 latency is drifting.
+
+This project closes that gap from the opposite end. The nginx instances that serve the traffic already observe everything an operator
+wants to see — they are the authoritative source for request rate, response code mix, bytes moved, and latency distributions. A small
+in-process module emits those numbers on an HTTP endpoint, and `maglevd-frontend` fans out to the backends of each frontend and aggregates
+the result into the existing status page.
+
+## Goals and Non-Goals
+
+### Product Goals
+
+1. **Per-VIP, per-maglev traffic visibility.** For each VIP, the module records request count, status-code distribution, bytes in and out,
+   and request-duration histograms, split by which `maglevd` instance delivered the traffic.
+2. **Negligible hot-path cost.** At steady state, a request traversing an nginx worker with the module loaded pays at most a handful of
+   non-atomic integer increments and a histogram bucket update. No locks, no allocations, no system calls.
+3. **Two readers, one endpoint.** A single HTTP location serves both Prometheus text and JSON, so a site running Prometheus and a site
+   using only the `maglevd-frontend` UI can both consume the module without extra configuration.
+4. **Packaging as a dynamic module.** The module builds with nginx's `--with-compat` ABI and ships as a Debian package that loads into
+   stock upstream nginx without recompiling nginx itself.
+5. **Composable with normal nginx use.** A host running the module as a maglev backend **and** serving unrelated direct web traffic on the
+   same ports MUST remain a correct nginx deployment. The module MUST NOT change the semantics of any existing directive; it only adds new
+   parameters and directives that are no-ops when unused.
+6. **Graceful reload.** An `nginx -s reload` MUST NOT reset counters, lose history, or drop in-flight connections from the module's point
+   of view.
+
+### Non-Goals
+
+- The module is **not** a generic nginx metrics exporter. It does not aim to replace `nginx-module-vts`, `ngx_http_stub_status`, or
+  `nginx-lua-prometheus`. Its metric set is deliberately narrow and shaped by the `maglevd-frontend` status page.
+- The module does **not** terminate TLS, rewrite headers, or alter the request in any way. It is observation-only.
+- The module does **not** talk to `maglevd` directly. It does not initiate gRPC, it does not read maglev configuration, and it does not
+  know which maglev instance owns which VIP. The attribution tag it emits is a string supplied by the operator in the `listen` directive;
+  nothing more.
+- The module does **not** provide per-client-IP, per-path, or per-User-Agent counters. Those dimensions explode cardinality and belong in
+  access logs and existing log-analysis tools.
+- The module does **not** provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on
+  restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus.
+- The module does **not** own the GRE tunnels, the VIP addresses, or the `SO_BINDTODEVICE` privilege. Tunnel creation, VIP binding, and
+  nginx master privileges are the operator's responsibility.
+
+## Requirements
+
+Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that later sections can cite it.
+
+### Functional Requirements
+
+**FR-1 Attribution**
+
+- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=<ifname>`, which causes the resulting
+  listening socket to be created with `SO_BINDTODEVICE` set to the named interface. A listen directive without `device=` MUST create a
+  plain listening socket as stock nginx does.
+- **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `source=<tag>`, which attaches a short string tag to
+  the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener.
+- **FR-1.3** A listening socket with neither `device=` nor `source=` MUST be tagged with the configured default source string (see
+  `ipng_stats_default_source`, FR-5.3). The default default is the literal string `direct`.
+- **FR-1.4** A listening socket with `device=X` but no `source=` MUST be tagged with the interface name `X`.
+- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist, and the kernel's TCP socket lookup
+  rules MUST be relied on to dispatch each SYN to the most specific match. The module MUST NOT attempt to duplicate this logic in
+  userspace.
+- **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=<ifname>` MUST accept only
+  connections whose ingress interface is `<ifname>`, for any local address served through that interface. This is the intended deployment
+  shape: wildcard fallback plus per-tunnel device-bound listeners.
+
+**FR-2 Counters**
+
+- **FR-2.1** The module MUST maintain, for every observed `(source, vip, status_code)` tuple, the following counters: total requests,
+  total bytes received (sum of request bytes including request line, headers, and body), total bytes sent (sum of response bytes
+  including status line, headers, and body), and a fixed-bucket histogram of request duration in milliseconds.
+- **FR-2.2** When an upstream is used to serve the request, the module MUST additionally maintain a fixed-bucket histogram of upstream
+  response time in milliseconds, keyed by the same `(source, vip)` pair.
+- **FR-2.3** The histogram bucket boundaries MUST be fixed at module initialization and MUST be the same for every `(source, vip)` key.
+  The default boundaries are `{1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000}` milliseconds plus an implicit `+Inf` bucket.
+  Operators MAY override the boundaries via the `ipng_stats_buckets` directive at the `http` level.
+- **FR-2.4** The module MUST additionally maintain, per `(source, vip)` pair, exponentially-weighted moving averages for instantaneous
+  request rate with decay windows of 1 second, 10 seconds, and 60 seconds. EWMAs are updated from the periodic flush tick (see FR-4.2),
+  not from the request path.
+- **FR-2.5** The `vip` dimension of every counter MUST be the connection's `$server_addr` in its canonical textual form (dotted-quad for
+  IPv4, RFC 5952 lowercase-compressed form for IPv6). IPv6 zone identifiers (scope-ids), if any, MUST be stripped during canonicalization;
+  link-local VIPs (which are not expected in practice) are attributed under their scope-less textual form. Port is not part of the key;
+  a VIP that listens on both 80 and 443 MUST be aggregated.
+- **FR-2.6** The `status_code` dimension MUST be the full three-digit HTTP status code as recorded by nginx at log phase. The module MUST
+  NOT bucket codes into classes (2xx/3xx/4xx/5xx); bucketing is the consumer's job.
+
+**FR-3 Scrape endpoint**
+
+- **FR-3.1** The module MUST provide a new nginx handler directive, `ipng_stats;`, that, when placed in a `location` block, causes that
+  location to serve the module's counters and MUST NOT be combinable with other content handlers in the same location.
+- **FR-3.2** The `ipng_stats` handler MUST support content negotiation via the `Accept` request header:
+  - `Accept: application/json` → JSON output.
+  - `Accept: text/plain` (or anything else, including absent) → Prometheus text exposition format.
+- **FR-3.3** The handler MUST support a `source=<tag>` query parameter that filters the output to only counters whose source dimension
+  equals the supplied tag. The comparison is exact-match and case-sensitive.
+- **FR-3.4** The handler MUST support a `vip=<address>` query parameter that filters the output to only counters whose VIP dimension
+  equals the supplied address. The comparison uses the canonicalized form of FR-2.5.
+- **FR-3.5** Both filters MAY be supplied together; their effect is the intersection.
+- **FR-3.6** The JSON schema MUST be documented in `docs/scrape-api.md` and MUST version via a top-level `schema` field so that breaking
+  changes can be made additively without bricking existing consumers.
+- **FR-3.7** The Prometheus text output MUST use stable metric names prefixed with `nginx_ipng_` and MUST label every series with `source`
+  and `vip`. Counter metrics additionally carry a `code` label.
+
+**FR-4 Hot path and flush**
+
+- **FR-4.1** Per-request counter updates MUST occur in the nginx log phase and MUST be localized to the current worker's private counter
+  table. The module MUST NOT take any locks on the request path and MUST NOT issue any atomic operation on the request path.
+- **FR-4.2** Each worker MUST run a periodic timer, default one second, that flushes the worker's private counter deltas into the
+  shared-memory zone using atomic adds. The flush interval is configurable via the `ipng_stats_flush_interval` directive.
+- **FR-4.3** The scrape handler MUST read only from the shared-memory zone. Workers MUST NOT read from each other's private tables.
+- **FR-4.4** Histogram updates MUST be branch-light: the module MUST precompute a small lookup that maps elapsed milliseconds to a bucket
+  index using binary search over the fixed boundary array, and MUST NOT scan the array linearly.
+
+**FR-5 Directives**
+
+- **FR-5.1** `ipng_stats_zone name:size` at the `http` level declares the shared-memory zone the module uses. `name` is the zone name (no
+  default); `size` is a size with suffix (`k`, `m`). The directive is mandatory if the module is loaded.
+- **FR-5.2** `ipng_stats_flush_interval <duration>` at the `http` level sets the worker flush cadence. Default `1s`. Minimum `100ms`.
+- **FR-5.3** `ipng_stats_default_source <tag>` at the `http` level sets the tag applied to listening sockets that have neither `device=`
+  nor `source=`. Default `direct`.
+- **FR-5.4** `ipng_stats_buckets <ms ms ms ...>` at the `http` level overrides the default histogram bucket boundaries. Values MUST be
+  strictly increasing positive integers.
+- **FR-5.5** `ipng_stats on|off` at the `http`, `server`, or `location` level opts a context into or out of counting. Default `on` at the
+  `http` level when the module is loaded. A location serving the `ipng_stats` handler MUST NOT have itself counted (the module
+  automatically sets `off` for the scrape location).
+
+**FR-6 Packaging**
+
+- **FR-6.1** The module MUST build as a dynamic module using nginx's `--with-compat --add-dynamic-module=...` flow, against the nginx-dev
+  headers of the target Debian release, so that the resulting `.so` loads into stock upstream nginx on that release without rebuilding
+  nginx itself.
+- **FR-6.2** The module MUST ship as a Debian package named `libnginx-mod-http-ipng-stats`, following the `libnginx-mod-http-*` naming
+  convention used by existing third-party nginx modules packaged for Debian.
+- **FR-6.3** The package MUST install:
+  - `/usr/lib/nginx/modules/ngx_http_ipng_stats_module.so`
+  - `/etc/nginx/modules-available/50-mod-http-ipng-stats.conf` containing the `load_module` directive.
+  - A symlink `/etc/nginx/modules-enabled/50-mod-http-ipng-stats.conf → ../modules-available/50-mod-http-ipng-stats.conf` created in the
+    package's postinst.
+- **FR-6.4** The package postinst MUST run `nginx -t` after installing the module. If the test fails, postinst MUST remove the
+  `modules-enabled` symlink and report a non-fatal warning so that a broken upgrade does not leave the operator's nginx unable to start.
+
+### Non-Functional Requirements
+
+**NFR-1 Correctness under concurrency**
+
+- **NFR-1.1** Per-worker counter tables MUST be owned exclusively by their worker and MUST NOT be read or written by any other worker,
+  any handler, or any timer other than the worker's own flush timer.
+- **NFR-1.2** Flushes from workers into the shared zone MUST use relaxed atomic `fetch_add` on 64-bit lanes. The module MUST NOT rely on
+  `memset`, `memcpy`, or any unaligned access for shared-zone updates.
+- **NFR-1.3** A scrape that races with a flush MUST observe a monotonically non-decreasing counter value; temporary readings that see
+  partial flushes across different keys are acceptable, but a single counter MUST never appear to decrease.
+- **NFR-1.4** Histogram bucket counts and sum/count fields MUST be updated in a way that a concurrent scrape never observes
+  `count < sum-of-buckets`. This is achieved by updating bucket counts before the sum/count and by a scraper that reads sum/count before
+  bucket counts.
+
+**NFR-2 Hot-path cost**
+
+- **NFR-2.1** The per-request cost of the log-phase handler MUST be bounded by: one listening-socket pointer deref, one VIP pointer deref
+  (cached on the connection struct), a constant-time status-code index computation, a constant number of integer increments, and a
+  `O(log B)` histogram binary search where `B` is the number of buckets. No syscalls, no allocations, no locks.
+- **NFR-2.2** The per-flush cost per worker MUST be bounded by `O(K)` atomic adds, where `K` is the number of distinct
+  `(source, vip, code)` keys touched by that worker since the last flush. Keys untouched during an interval MUST NOT be visited.
+- **NFR-2.3** The scrape cost MUST be bounded by `O(K_total)` reads from the shared zone plus `O(K_total)` string format operations,
+  where `K_total` is the number of distinct keys in the zone.
+
+**NFR-3 Memory bounds**
+
+- **NFR-3.1** The shared-memory zone MUST be sized by the operator at module-load time (FR-5.1) and MUST NOT grow beyond that size. When
+  the zone is full, the module MUST drop new keys, increment a dedicated `nginx_ipng_zone_full_events_total` counter, and log at `warn`
+  level no more than once per minute per worker.
+- **NFR-3.2** The per-worker private counter table MUST be bounded by the same total key count the shared zone admits. A worker MUST NOT
+  accumulate private state that exceeds the shared-zone capacity.
+- **NFR-3.3** The set of distinct status codes observed is small (typically ≤ 60) and MUST NOT be allowed to explode due to non-standard
+  responses; the module MUST clamp any observed code `< 100` or `>= 600` into a single bucket labeled `code="unknown"` rather than
+  allocating a new key.
+
+**NFR-4 Reload neutrality**
+
+- **NFR-4.1** `nginx -s reload` spawns a new set of workers while the old workers drain. The shared-memory zone MUST survive this
+  transition; counters MUST NOT reset on reload.
+- **NFR-4.2** New workers MUST attach to the existing shared-memory zone under the same name, reconstruct their private counter tables
+  lazily from observed traffic, and resume flushing.
+- **NFR-4.3** The `source` tag for any given listening socket is recomputed at reload time from the new configuration. If the operator
+  renames a tag, new traffic MUST use the new tag.
+- **NFR-4.4** When a `source` tag is no longer present in any listening socket after a configuration reload, its counters MUST be
+  evicted from the shared-memory zone on the first flush tick following the reload. The module MUST NOT retain historical counters under
+  defunct tags indefinitely. Rename is expected to be rare and evicting the old entries immediately is acceptable.
+
+**NFR-5 Packaging robustness**
+
+- **NFR-5.1** The module MUST compile cleanly against the nginx-dev headers of the currently supported Debian stable and testing
+  releases. CI MUST build one `.deb` per supported release and MUST fail if any target breaks.
+- **NFR-5.2** The module MUST NOT depend on any shared library beyond `libc` and nginx's own runtime. No `libnetfilter_*`, no `libcurl`,
+  no `libjson*`.
+- **NFR-5.3** A version mismatch between the `.so` and the installed nginx binary MUST be detected by nginx at load time (this is the
+  purpose of `--with-compat`). The package postinst MUST NOT attempt to work around a mismatch; it reports the failure and leaves the
+  operator to upgrade the nginx package.
+
+**NFR-6 Security**
+
+- **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. The `SO_BINDTODEVICE` call is made
+  in the nginx master process which is already privileged during the bind step; workers never call `setsockopt(SO_BINDTODEVICE)`
+  themselves.
+- **NFR-6.2** The scrape endpoint MUST be accessible only via an `allow`/`deny` ACL placed in the operator's nginx config. The module
+  MUST NOT ship its own authentication or authorization; it is a plain nginx handler and inherits the enclosing `location` block's access
+  controls.
+- **NFR-6.3** The module MUST NOT log client IPs, request paths, `User-Agent`, or any other per-request personally-identifying field. It
+  logs only aggregate counters and its own operational events.
+
+**NFR-7 Documentation**
+
+- **NFR-7.1** The repository MUST ship a `docs/user-guide.md` that walks an operator through installing the Debian package, loading the
+  module, configuring a minimal end-to-end deployment (GRE tunnels, VIPs, `listen` lines, scrape endpoint), verifying that counters are
+  flowing, and integrating the scrape endpoint with both `maglevd-frontend` and a standalone Prometheus scraper. The user guide is the
+  document an operator reads once to get from a freshly-installed package to a working, observable deployment.
+- **NFR-7.2** The repository MUST ship a `docs/config-guide.md` that enumerates every directive and `listen` parameter introduced by the
+  module, together with the nginx configuration contexts (`http`, `server`, `location`, or `listen`) in which each is legal, the allowed
+  values, the default, and a one-sentence summary of behavior. The config guide is the document an operator greps when they need to know
+  where a given knob is allowed to appear.
+
+## Architecture Overview
+
+### Process Model
+
+The project ships one dynamic nginx module:
+
+- **`ngx_http_ipng_stats_module.so`** — the dynamic module, loaded by nginx's master at startup via `load_module`. It runs entirely inside
+  the nginx process model: code executes in nginx workers during the request lifecycle and during per-worker timers. No separate process
+  is launched.
+
+There is no daemon, no socket the module listens on, no control plane. Everything the module does is done inline with nginx.
+
+### Data Flow
+
+Requests enter nginx through one of two listener classes:
+
+1. **Device-bound listeners** (`listen ... device=X source=Y`) accept only connections whose ingress interface is `X`. Each is tagged
+   with a source string `Y`.
+2. **Wildcard fallback listeners** (`listen 80;`, `listen [::]:80;`) accept everything that didn't match a more specific listener. They
+   are tagged with the configured default source (FR-1.3).
+
+During request processing nginx behaves exactly as it would without the module: no handler runs early, no header is rewritten. At log
+phase, the module's log-phase handler increments the worker-local counter table keyed by `(source, vip, status_code)`.
+
+A per-worker timer, firing at the configured flush interval (FR-5.2), walks the dirty keys in the worker-local table and applies their
+deltas to the shared-memory zone via atomic adds.
+
+The scrape handler, when invoked at `GET /ipng-stats` (or whatever location the operator chose), reads the shared-memory zone directly
+and formats the output per the requested content type.
+
+`maglevd-frontend` fetches the scrape endpoint of each backend in its configured fleet at roughly the same cadence it already uses for
+maglevd state. It filters server-side via `?source=<its own tag>` so that it only sees the traffic it delivered. The aggregated view is
+rendered alongside the existing maglev status page.
+
+No component in this project writes to anything outside nginx's own memory. In particular, the module does not touch the file system,
+does not emit log lines on the request path, and does not speak to any upstream.
+
+## Components
+
+### The nginx module
+
+`ngx_http_ipng_stats_module` is the entire technical surface of this project. It is a single C module conforming to nginx's
+dynamic-module ABI.
+
+#### Responsibilities
+
+- Parse new `listen` parameters `device=` and `source=` and attach their values to each listening socket's config (FR-1.1, FR-1.2).
+- Call `setsockopt(SO_BINDTODEVICE)` in the master process at bind time for listeners that set `device=` (FR-1.1, NFR-6.1).
+- Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_code)` (FR-2.1, NFR-1.1).
+- Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2).
+- Update EWMAs at flush time (FR-2.4).
+- Serve the scrape endpoint with content negotiation and optional filters (FR-3).
+- Honor `ipng_stats on|off` at any config level (FR-5.5).
+
+#### Attribution Model
+
+The module's single novel idea is that per-maglev attribution is done by the Linux kernel's TCP socket lookup, not by any userspace
+inspection. Each `maglevd` instance terminates its GRE tunnel on a dedicated interface on the nginx host; the operator writes one
+`listen ... device=<ifname> source=<tag>` line per `(family, tunnel)` pair. The kernel binds that listening socket with `SO_BINDTODEVICE`,
+which causes it to match only connections whose ingress interface is that tunnel. A wildcard `listen 80;` and `listen [::]:80;` pair
+provides the fallback for traffic arriving on any other interface — typically normal web traffic, not from maglev.
+
+The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback
+and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to.
+
+Because the `device=` binding uses a wildcard address, the module does not need to know the set of VIPs served through each tunnel.
+Adding a VIP (binding an address to `lo` and writing a new `server_name` block) does not require touching the `listen` lines. Adding a
+new maglev instance (a new GRE tunnel) does. This is the correct split: VIPs are vhost-level concerns and change often; maglev instances
+are fleet-level concerns and change rarely.
+
+The design assumes GRE tunnels used as `device=` sources carry **only** maglev-originated traffic. Any other traffic arriving on such an
+interface is silently misattributed to that maglev's source tag. This is a deployment invariant, not a defect.
+
+#### Counter Data Model
+
+Counters are stored as a flat hash table in a shared-memory zone. The key is the tuple `(source_id, vip_id, status_code)` where
+`source_id` and `vip_id` are small integers assigned at first observation and reused thereafter. The value is a fixed-size record
+containing:
+
+- `requests` (u64)
+- `bytes_in` (u64)
+- `bytes_out` (u64)
+- `duration_hist` — `B+1` u64 lanes (one per bucket plus the `+Inf` bucket)
+- `duration_sum_ms` (u64)
+- `upstream_hist` — same shape, only updated when an upstream served the request
+- `upstream_sum_ms` (u64)
+
+A parallel table keyed by `(source_id, vip_id)` — one row per VIP — holds the EWMAs for instantaneous rate. EWMAs are floats but updated
+only from the flush tick, so there is no float contention on the request path.
+
+The module also keeps a small string interning table for source and VIP strings, keyed by the integer IDs above, so that the scrape
+endpoint can recover the original strings without re-parsing configuration.
+
+String interning is capacity-bounded: the zone is sized by the operator, and once capacity is exhausted new keys are dropped with a
+counter bump and an infrequent log line (NFR-3.1). In practice, the number of distinct VIPs on a single nginx host is small (tens, maybe
+low hundreds), and the number of distinct source tags is the number of maglev instances (single digits). The dominant factor is
+`status_code`; ~60 keys per VIP is a typical steady state.
+
+#### Hot Path
+
+The log-phase handler is deliberately short. Pseudocode:
+
+```c
+static ngx_int_t
+ipng_stats_log_handler(ngx_http_request_t *r)
+{
+    ipng_listen_ctx_t  *lctx;
+    ipng_counter_t     *counter;
+    ngx_msec_int_t      elapsed_ms;
+    ngx_uint_t          code_idx;
+
+    if (!ipng_stats_enabled(r)) {
+        return NGX_OK;
+    }
+
+    lctx = ngx_http_ipng_stats_listen_ctx(r->connection->listening);
+    /* lctx contains source_id and the cached VIP id,
+       or resolves VIP lazily on first seen address */
+
+    code_idx = ipng_status_to_index(r->headers_out.status);
+    counter  = ipng_worker_slot(lctx, r->connection->local_sockaddr, code_idx);
+
+    counter->requests++;
+    counter->bytes_in  += r->request_length;
+    counter->bytes_out += r->connection->sent;
+
+    elapsed_ms = (ngx_msec_int_t)(ngx_current_msec - r->start_msec);
+    ipng_hist_add(&counter->duration_hist, elapsed_ms);
+    counter->duration_sum_ms += elapsed_ms;
+
+    if (r->upstream_states && r->upstream_states->nelts > 0) {
+        ngx_msec_int_t up_ms = ipng_upstream_total_ms(r);
+        ipng_hist_add(&counter->upstream_hist, up_ms);
+        counter->upstream_sum_ms += up_ms;
+    }
+
+    return NGX_OK;
+}
+```
+
+Nothing here touches shared memory. `ipng_worker_slot` resolves a private table slot using a small per-worker hash keyed by
+`(source_id, vip_id, code_idx)`. VIP lookup is cached on the connection so that keep-alive requests reuse the resolved ID.
+
+#### Flush Timer
+
+At the interval configured by `ipng_stats_flush_interval` (default 1s), the worker:
+
+1. Iterates its dirty-slot list (slots touched since the previous flush).
+2. For each dirty slot, computes the delta versus the last flushed snapshot stored in the same slot.
+3. Applies the delta to the shared-zone slot using 64-bit relaxed `fetch_add` on each counter lane.
+4. Updates EWMAs from the delta.
+5. Clears the dirty list (not the slot itself; slot state is preserved so the next flush can compute deltas again).
+
+The worker never walks the entire table — only dirty slots — so idle VIPs cost nothing.
+
+#### Scrape Handler
+
+The `ipng_stats` handler is a leaf content handler. It:
+
+1. Parses `?source=` and `?vip=` into exact-match filters.
+2. Parses `Accept:` to pick output format.
+3. Walks the shared-memory zone under a shared lock (readers hold the read side of a rwlock; flushes and interners hold the write side
+   briefly).
+4. Emits each matching key in the chosen format directly into an nginx chain buffer.
+
+Output buffering and sending are standard nginx content handler code. The handler does not allocate during the walk; it uses a
+fixed-size buffer per chain link and requests new links only when full.
+
+#### Presents and Consumes
+
+**Presents.**
+
+- **One nginx content handler**, `ipng_stats`, usable in any `location` block. Serves Prometheus text and JSON, filtered by optional
+  query parameters.
+- **Two new `listen` parameters**, `device=` and `source=`, usable anywhere a `listen` directive is used.
+- **Five new `http`-level directives**: `ipng_stats_zone`, `ipng_stats_flush_interval`, `ipng_stats_default_source`,
+  `ipng_stats_buckets`, `ipng_stats` (on/off).
+- **A Prometheus metric family** prefixed `nginx_ipng_*`, labelled `source`, `vip`, and (for request counters) `code`.
+
+**Consumes.**
+
+- **An nginx shared-memory zone** declared by `ipng_stats_zone`. The zone is allocated from nginx's own shared-memory pool.
+- **The Linux `SO_BINDTODEVICE` socket option**, applied in the nginx master process during bind.
+- **The nginx log phase and connection structures** — standard module embedding, no private kernel calls.
+
+### The Debian package
+
+`libnginx-mod-http-ipng-stats` is the packaging wrapper. There is no ambition to build RPMs, Alpine packages, or a Homebrew formula;
+Debian is the target and upstream nginx on Debian is the platform.
+
+#### Responsibilities
+
+- Build the module against the target release's nginx-dev headers with `--with-compat` (NFR-5.1, NFR-5.3).
+- Install the compiled `.so` into `/usr/lib/nginx/modules` (FR-6.3).
+- Drop a `load_module` stanza into `/etc/nginx/modules-available/` and enable it by default via a symlink in `modules-enabled/`
+  (FR-6.3).
+- Sanity-check the resulting config with `nginx -t` in the postinst and back out cleanly if it fails (FR-6.4).
+
+#### Build
+
+The build is a plain `debian/rules` invocation that:
+
+1. Fetches the nginx source for the installed `nginx-dev` version.
+2. Runs `./configure --with-compat --add-dynamic-module=...` pointed at the module tree.
+3. Builds only the module (`make modules`).
+4. Installs the resulting `.so` into the package tree.
+
+No nginx binary is produced, shipped, or touched. The package is strictly additive.
+
+#### Presents and Consumes
+
+**Presents.**
+
+- **One Debian package** per supported release.
+- **One dynamic module** loadable into stock upstream nginx.
+
+**Consumes.**
+
+- **The target release's `nginx-dev` package** at build time.
+- **The running `nginx` package** at install time, for `nginx -t` validation.
+
+## Operational Concerns
+
+### Deployment Topology
+
+A typical deployment on a single nginx host looks like:
+
+- One GRE tunnel per maglev instance, terminated on the nginx host by the operator's networking layer (systemd-networkd, Netplan, or a
+  hand-rolled interface config). Interface names follow a consistent pattern, typically `gre-<tag>` — e.g. `gre-mg1`, `gre-mg2`.
+- VIPs bound to a local dummy or loopback interface so the kernel accepts inner packets destined for them.
+- A hand-maintained `listen` include file with one device-bound listen per `(family, tunnel)` pair, reused across vhosts.
+- Fallback `listen 80;` and `listen [::]:80;` in whichever server blocks serve direct web traffic.
+- A single scrape location, e.g. `location = /ipng-stats`, served from a locked-down server block that only allows the maglev fleet and
+  the local Prometheus scraper.
+
+### Configuration
+
+A minimal working configuration is about fifteen lines:
+
+```nginx
+load_module modules/ngx_http_ipng_stats_module.so;
+
+http {
+    ipng_stats_zone ipng:4m;
+
+    server {
+        listen 80;
+        listen [::]:80;
+        include /etc/nginx/ipng-maglev/listens.conf;
+
+        server_name _;
+        # ... normal vhost content
+    }
+
+    server {
+        listen 127.0.0.1:9113;
+        location = /ipng-stats {
+            ipng_stats;
+            allow 127.0.0.1;
+            allow 2001:db8::/48;   # maglev fleet
+            deny all;
+        }
+    }
+}
+```
+
+`listens.conf` is eight lines (two families × four maglevs) and stable across vhost changes.
+
+### Nginx Reload Semantics
+
+`nginx -s reload` forks fresh workers, has old workers finish in-flight requests, and then shuts the old workers down. The plugin's
+shared-memory zone is declared by name, which survives the reload; new workers attach to the same zone and continue accumulating
+counters against the same keys. Counters MUST NOT reset on reload (NFR-4.1).
+
+Source tags are recomputed from the new configuration on reload (NFR-4.3). Renaming a tag in configuration means new traffic appears
+under the new name; the old name lingers in the zone until either operator restart or an LRU eviction policy ages it out (this is one
+of the open questions below).
+
+### Observability of the Plugin Itself
+
+The plugin emits a handful of meta-metrics on the same scrape endpoint:
+
+- `nginx_ipng_zone_bytes_used` / `nginx_ipng_zone_bytes_total` — zone high-water and capacity.
+- `nginx_ipng_zone_full_events_total` — number of key insertions that were dropped because the zone was full.
+- `nginx_ipng_flushes_total` — number of per-worker flush ticks that have run.
+- `nginx_ipng_flush_duration_seconds` — histogram of flush durations.
+- `nginx_ipng_scrape_duration_seconds` — histogram of scrape handler durations.
+
+These make it possible to alert on "the module is running hot" and "the zone is full" without having to run a second scraper against
+some other endpoint.
+
+### Failure Modes
+
+- **Shared zone full.** New keys are dropped, a counter is incremented, a rate-limited warning is logged, and the operator is expected
+  to resize the zone. Existing keys continue updating normally (NFR-3.1).
+- **Worker crash.** The crashed worker's private counter deltas since its last flush are lost. The shared zone is unaffected. Since the
+  default flush interval is one second, the worst-case data loss is one second of that worker's traffic. This is acceptable for an
+  observability plane.
+- **nginx master crash / package upgrade.** The shared zone is torn down with the old master. When the new master starts, the zone is
+  recreated empty. Counters start from zero. Consumers that need history SHOULD read from Prometheus, which retains history across
+  restarts.
+- **Device disappears.** If an operator removes a GRE tunnel without removing its `listen` line, nginx's bind will fail on the next
+  reload and the reload will error cleanly. The module does not hide this; a failing `nginx -t` is the right answer.
+- **Traffic on a wildcard listener that should have been device-bound.** The traffic is counted under `direct` (or the configured
+  default). This is detectable: if the operator expects zero traffic under `direct` and the dashboard shows non-zero, a maglev instance
+  is probably missing from the `listen` include.
+- **Slow scrape on a large zone.** Scrape cost is linear in the number of keys (NFR-2.3). On a host with a very large VIP count, the
+  operator SHOULD increase the flush interval, lower the scrape frequency, or both. The module does not cap scrape runtime.
+- **Maglev frontend is down.** The module is unaffected; its counters continue to increment and the Prometheus scrape continues to work.
+  When the frontend comes back, it resumes fetching. No state is lost.
+
+### Security
+
+- **Capabilities.** The module needs no capabilities beyond what nginx already has. `SO_BINDTODEVICE` is called by the master during
+  bind; workers never call it (NFR-6.1).
+- **Scrape access control.** The operator MUST place the scrape `location` behind an `allow`/`deny` ACL. The module does not ship auth;
+  this is deliberate, and documented (NFR-6.2).
+- **No PII.** The module records only aggregate per-VIP counters. Client IP, request path, headers, and cookies are not touched.
+  Access-log-style observation belongs in nginx's own access log (NFR-6.3).
+- **Zone sizing as a soft DoS mitigation.** Because new keys are dropped when the zone is full rather than allocating unbounded memory,
+  a stream of bogus traffic cannot cause the module to exhaust nginx's memory. The tradeoff is that a real new VIP added after zone
+  exhaustion won't be tracked until the operator resizes — explicit and visible in the meta-metrics.
+
+## Alternatives Considered
+
+- **OpenResty + `lua-nginx-module` + `nginx-lua-prometheus`.** Rejected. Adds a large runtime dependency just for a narrow feature. The
+  deployment target is stock upstream nginx on Debian, and shipping an entirely different nginx build would defeat half the point of
+  packaging.
+- **Access log tailing sidecar.** Rejected. Decoupled but introduces a second deploy unit, a log-rotation race, and a synchronization
+  gap between access log truncation and counter accuracy. Also loses live EWMAs.
+- **`nginx-module-vts`.** Considered. VTS is a perfectly good general-purpose metric module, but it has no concept of "which ingress
+  interface did this request come in on", which is the entire innovation here. Adapting VTS to attribute by ingress interface would be a
+  bigger diff than writing a purpose-built module.
+- **Attribution via CONNMARK on a single shared GRE tunnel.** Rejected after investigation. Netfilter loses the outer GRE source during
+  decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked, `SO_MARK` on an
+  accepted socket does not reflect incoming packet or conntrack mark without a per-packet `libnetfilter_conntrack` lookup, which is too
+  heavy for a log-phase handler.
+- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than `SO_BINDTODEVICE`: it still requires per-maglev
+  tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `SO_BINDTODEVICE` solves the same problem with
+  kernel primitives nginx already knows about.
+- **Attribution via eBPF `SO_REUSEPORT` programs.** Rejected as dramatic overkill for a problem the kernel already solves for free via
+  socket-lookup specificity.
+- **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1;`. The wildcard form works
+  because nginx routes by `server_name` post-accept, so the `listen` only needs to express `(port, device)` and does not need the VIP
+  address. This makes the generated include file size independent of the VIP count.
+- **Pushing counters from the module into `maglevd` over gRPC.** Rejected. It inverts the wait-for graph (maglevd's design doc is
+  careful to keep the daemon free of callbacks from the backends), it complicates restart neutrality, and it adds a gRPC client to a C
+  module. Pull-based scrape keeps maglevd out of the traffic-metrics business, matches the doc's philosophy, and lets the frontend use
+  its existing per-server goroutine model.
+- **Shipping separate JSON and Prometheus handlers.** Rejected. Content negotiation on one handler is simpler to configure and serves
+  both audiences from one ACL.
+
+## Decisions Deferred Post-v0.1
+
+- **Histogram bucket overrides per `source` or per `vip`.** v0.1 keeps FR-2.3's module-level set. If a single nginx instance ever serves
+  both latency-sensitive (API) and bulk (download) traffic on the same host such that one bucket set is too compromised, making buckets
+  per-`source` or per-`vip` is possible but multiplies memory and complicates Prometheus output.
+- **TLS handshake metrics.** The module reports `request_duration` from the start of the HTTP request, not from TCP accept. For
+  TLS-terminating frontends a handshake-time fraction is invisible. Adding a `tls_handshake_duration` histogram is deferred until
+  operators ask for it.
+- **`maglevd-frontend` fetch cadence.** Whichever cadence the frontend adopts for traffic counters — the existing ~one-second refresh,
+  or an SSE bridge layered on top — the plugin supports it. The choice is on the frontend side.