Add ngx_http_ipng_stats_module: per-VIP, per-device traffic counters
Full implementation of the nginx dynamic module with: - SO_BINDTODEVICE-based per-interface traffic attribution - Per-worker lock-free counters flushed to shared memory - Prometheus text and JSON scrape endpoint at configurable location - UDP-only global logtail (ipng_stats_logtail) for fire-and-forget access log streaming - $ipng_source_tag nginx variable for use in log_format/map - Histogram buckets, EWMA rate gauges, zone meta-metrics - Debian packaging (libnginx-mod-http-ipng-stats) - Robot Framework end-to-end tests via containerlab - SPDX Apache-2.0 headers on all source files
This commit is contained in:
213
docs/design.md
213
docs/design.md
@@ -1,4 +1,5 @@
|
||||
# nginx-vpp-maglev-plugin Design Document
|
||||
<!-- SPDX-License-Identifier: Apache-2.0 -->
|
||||
# nginx-ipng-stats-plugin Design Document
|
||||
|
||||
## Metadata
|
||||
|
||||
@@ -7,7 +8,7 @@
|
||||
| **Status** | Draft — describes intended behavior for `v0.1.0` |
|
||||
| **Author** | Pim van Pelt `<pim@ipng.ch>` |
|
||||
| **Last updated** | 2026-04-16 |
|
||||
| **Audience** | Operators and contributors building the nginx-side observability half of `vpp-maglev` |
|
||||
| **Audience** | Operators and contributors deploying per-device, per-VIP traffic observability on nginx |
|
||||
|
||||
The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and **MAY** are used as described in
|
||||
[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are reserved in this document for requirements that are intended to be
|
||||
@@ -16,60 +17,52 @@ lowercase — "can", "will", "does" — and should not be read as normative.
|
||||
|
||||
## Summary
|
||||
|
||||
`nginx-vpp-maglev-plugin` is a dynamic nginx module and its surrounding Debian packaging. Loaded into stock upstream nginx, the module records
|
||||
per-VIP traffic counters — requests, status codes, bytes, latency — and attributes them to the specific `vpp-maglev` instance whose GRE
|
||||
tunnel delivered each connection. A small HTTP scrape endpoint exposes the counters as both Prometheus text and JSON so that
|
||||
`maglevd-frontend`, Prometheus, and ad-hoc `curl` sessions can all read the same data. The module is the nginx-side answer to the open
|
||||
question in [`vpp-maglev/docs/design.md`](../../vpp-maglev/docs/design.md) about per-backend traffic counters: VPP's `lb` plugin bypasses
|
||||
the FIB and cannot produce them, so the backends report what they see.
|
||||
`nginx-ipng-stats-plugin` is a dynamic nginx module and its surrounding Debian packaging. Loaded into stock upstream nginx, the module
|
||||
records per-VIP traffic counters — requests, status codes, bytes, latency — and attributes them to the specific interface on which each
|
||||
connection arrived. A small HTTP scrape endpoint exposes the counters as both Prometheus text and JSON so that Prometheus, custom
|
||||
dashboards, and ad-hoc `curl` sessions can all read the same data.
|
||||
|
||||
## Background
|
||||
|
||||
`vpp-maglev` programs VPP's `lb` plugin so that traffic hashed to a VIP lands on a pool of healthy Application Servers (ASes). For the
|
||||
deployment this module targets, every AS is an nginx instance receiving GRE-encapsulated traffic from one or more `maglevd` daemons,
|
||||
decapsulating it, and terminating or proxying HTTP and HTTPS as it would for any other inbound client.
|
||||
Any deployment where traffic arrives on distinct Linux interfaces — GRE tunnels, VLANs, VXLANs, bonded links, or plain ethernet — can
|
||||
benefit from per-interface traffic visibility. The nginx instances that serve the traffic already observe everything an operator wants to
|
||||
see — they are the authoritative source for request rate, response code mix, bytes moved, and latency distributions. A small in-process
|
||||
module emits those numbers on an HTTP endpoint, and consumers scrape the data filtered by source tag.
|
||||
|
||||
The design document for `vpp-maglev` identifies **per-AS traffic counters** as an explicit open question: VPP's `lb` fast path bypasses
|
||||
the FIB, so VPP exposes per-VIP counters in the stats segment but not per-backend ones. An operator looking at the `maglevd-frontend`
|
||||
status page for a frontend with four backends can see the frontend's aggregate packet rate but not which backend is carrying how much of
|
||||
it, which errors are concentrated on which backend, or whether one backend's p95 latency is drifting.
|
||||
|
||||
This project closes that gap from the opposite end. The nginx instances that serve the traffic already observe everything an operator
|
||||
wants to see — they are the authoritative source for request rate, response code mix, bytes moved, and latency distributions. A small
|
||||
in-process module emits those numbers on an HTTP endpoint, and `maglevd-frontend` fans out to the backends of each frontend and aggregates
|
||||
the result into the existing status page.
|
||||
One motivating use case is [`vpp-maglev`](https://git.ipng.ch/ipng/vpp-maglev), where each load-balancer instance terminates a GRE
|
||||
tunnel on the nginx host. The module attributes traffic per tunnel, letting the frontend show per-backend counters that VPP's fast path
|
||||
cannot provide. But the module is not coupled to that use case — it works with any interface type and any consumer.
|
||||
|
||||
## Goals and Non-Goals
|
||||
|
||||
### Product Goals
|
||||
|
||||
1. **Per-VIP, per-maglev traffic visibility.** For each VIP, the module records request count, status-code distribution, bytes in and out,
|
||||
and request-duration histograms, split by which `maglevd` instance delivered the traffic.
|
||||
1. **Per-VIP, per-device traffic visibility.** For each VIP, the module records request count, status-code distribution, bytes in and
|
||||
out, and request-duration histograms, split by which interface delivered the traffic.
|
||||
2. **Negligible hot-path cost.** At steady state, a request traversing an nginx worker with the module loaded pays at most a handful of
|
||||
non-atomic integer increments and a histogram bucket update. No locks, no allocations, no system calls.
|
||||
3. **Two readers, one endpoint.** A single HTTP location serves both Prometheus text and JSON, so a site running Prometheus and a site
|
||||
using only the `maglevd-frontend` UI can both consume the module without extra configuration.
|
||||
using a custom consumer can both consume the module without extra configuration.
|
||||
4. **Packaging as a dynamic module.** The module builds with nginx's `--with-compat` ABI and ships as a Debian package that loads into
|
||||
stock upstream nginx without recompiling nginx itself.
|
||||
5. **Composable with normal nginx use.** A host running the module as a maglev backend **and** serving unrelated direct web traffic on the
|
||||
same ports MUST remain a correct nginx deployment. The module MUST NOT change the semantics of any existing directive; it only adds new
|
||||
parameters and directives that are no-ops when unused.
|
||||
5. **Composable with normal nginx use.** A host running the module with device-bound listeners **and** serving unrelated direct web
|
||||
traffic on the same ports MUST remain a correct nginx deployment. The module MUST NOT change the semantics of any existing directive;
|
||||
it only adds new parameters and directives that are no-ops when unused.
|
||||
6. **Graceful reload.** An `nginx -s reload` MUST NOT reset counters, lose history, or drop in-flight connections from the module's point
|
||||
of view.
|
||||
|
||||
### Non-Goals
|
||||
|
||||
- The module is **not** a generic nginx metrics exporter. It does not aim to replace `nginx-module-vts`, `ngx_http_stub_status`, or
|
||||
`nginx-lua-prometheus`. Its metric set is deliberately narrow and shaped by the `maglevd-frontend` status page.
|
||||
`nginx-lua-prometheus`. Its metric set is deliberately narrow: per-VIP, per-device counters, histograms, and rate gauges.
|
||||
- The module does **not** terminate TLS, rewrite headers, or alter the request in any way. It is observation-only.
|
||||
- The module does **not** talk to `maglevd` directly. It does not initiate gRPC, it does not read maglev configuration, and it does not
|
||||
know which maglev instance owns which VIP. The attribution tag it emits is a string supplied by the operator in the `listen` directive;
|
||||
nothing more.
|
||||
- The module does **not** talk to any external daemon. It does not initiate gRPC or read any external configuration. The attribution tag
|
||||
it emits is a string supplied by the operator in the `listen` directive; nothing more.
|
||||
- The module does **not** provide per-client-IP, per-path, or per-User-Agent counters. Those dimensions explode cardinality and belong in
|
||||
access logs and existing log-analysis tools.
|
||||
- The module does **not** provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on
|
||||
restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus.
|
||||
- The module does **not** own the GRE tunnels, the VIP addresses, or the `SO_BINDTODEVICE` privilege. Tunnel creation, VIP binding, and
|
||||
- The module does **not** own the interfaces, the VIP addresses, or the `SO_BINDTODEVICE` privilege. Interface creation, VIP binding, and
|
||||
nginx master privileges are the operator's responsibility.
|
||||
|
||||
## Requirements
|
||||
@@ -83,11 +76,11 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
|
||||
- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=<ifname>`, which causes the resulting
|
||||
listening socket to be created with `SO_BINDTODEVICE` set to the named interface. A listen directive without `device=` MUST create a
|
||||
plain listening socket as stock nginx does.
|
||||
- **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `source=<tag>`, which attaches a short string tag to
|
||||
- **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `ipng_source_tag=<tag>`, which attaches a short string tag to
|
||||
the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener.
|
||||
- **FR-1.3** A listening socket with neither `device=` nor `source=` MUST be tagged with the configured default source string (see
|
||||
- **FR-1.3** A listening socket with neither `device=` nor `ipng_source_tag=` MUST be tagged with the configured default source string (see
|
||||
`ipng_stats_default_source`, FR-5.3). The default default is the literal string `direct`.
|
||||
- **FR-1.4** A listening socket with `device=X` but no `source=` MUST be tagged with the interface name `X`.
|
||||
- **FR-1.4** A listening socket with `device=X` but no `ipng_source_tag=` MUST be tagged with the interface name `X`.
|
||||
- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist, and the kernel's TCP socket lookup
|
||||
rules MUST be relied on to dispatch each SYN to the most specific match. The module MUST NOT attempt to duplicate this logic in
|
||||
userspace.
|
||||
@@ -122,14 +115,14 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
|
||||
- **FR-3.2** The `ipng_stats` handler MUST support content negotiation via the `Accept` request header:
|
||||
- `Accept: application/json` → JSON output.
|
||||
- `Accept: text/plain` (or anything else, including absent) → Prometheus text exposition format.
|
||||
- **FR-3.3** The handler MUST support a `source=<tag>` query parameter that filters the output to only counters whose source dimension
|
||||
- **FR-3.3** The handler MUST support a `source_tag=<tag>` query parameter that filters the output to only counters whose source dimension
|
||||
equals the supplied tag. The comparison is exact-match and case-sensitive.
|
||||
- **FR-3.4** The handler MUST support a `vip=<address>` query parameter that filters the output to only counters whose VIP dimension
|
||||
equals the supplied address. The comparison uses the canonicalized form of FR-2.5.
|
||||
- **FR-3.5** Both filters MAY be supplied together; their effect is the intersection.
|
||||
- **FR-3.6** The JSON schema MUST be documented in `docs/scrape-api.md` and MUST version via a top-level `schema` field so that breaking
|
||||
changes can be made additively without bricking existing consumers.
|
||||
- **FR-3.7** The Prometheus text output MUST use stable metric names prefixed with `nginx_ipng_` and MUST label every series with `source`
|
||||
- **FR-3.7** The Prometheus text output MUST use stable metric names prefixed with `nginx_ipng_` and MUST label every series with `source_tag`
|
||||
and `vip`. Counter metrics additionally carry a `code` label.
|
||||
|
||||
**FR-4 Hot path and flush**
|
||||
@@ -148,28 +141,60 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
|
||||
default); `size` is a size with suffix (`k`, `m`). The directive is mandatory if the module is loaded.
|
||||
- **FR-5.2** `ipng_stats_flush_interval <duration>` at the `http` level sets the worker flush cadence. Default `1s`. Minimum `100ms`.
|
||||
- **FR-5.3** `ipng_stats_default_source <tag>` at the `http` level sets the tag applied to listening sockets that have neither `device=`
|
||||
nor `source=`. Default `direct`.
|
||||
nor `ipng_source_tag=`. Default `direct`.
|
||||
- **FR-5.4** `ipng_stats_buckets <ms ms ms ...>` at the `http` level overrides the default histogram bucket boundaries. Values MUST be
|
||||
strictly increasing positive integers.
|
||||
- **FR-5.5** `ipng_stats on|off` at the `http`, `server`, or `location` level opts a context into or out of counting. Default `on` at the
|
||||
`http` level when the module is loaded. A location serving the `ipng_stats` handler MUST NOT have itself counted (the module
|
||||
automatically sets `off` for the scrape location).
|
||||
|
||||
**FR-6 Packaging**
|
||||
**FR-6 Variables**
|
||||
|
||||
- **FR-6.1** The module MUST build as a dynamic module using nginx's `--with-compat --add-dynamic-module=...` flow, against the nginx-dev
|
||||
- **FR-6.1** The module MUST register an nginx variable `$ipng_source_tag` that resolves to the source tag of the listening socket that
|
||||
accepted the current connection. For device-bound listeners this is the `ipng_source_tag=` value (or the `device=` name if
|
||||
`ipng_source_tag=` was not set); for wildcard fallback listeners this is the value of `ipng_stats_default_source`. The variable is
|
||||
usable in `log_format`, `map`, `add_header`, `if`, and any other nginx context that accepts variables.
|
||||
- **FR-6.2** `$ipng_source_tag` MUST be available unconditionally when the module is loaded, even if `ipng_stats_zone` is not
|
||||
configured. It does not depend on the counter subsystem; it only depends on the listen-parameter parsing. Operators who need the VIP
|
||||
address should use nginx's built-in `$server_addr` variable.
|
||||
|
||||
**FR-7 Packaging**
|
||||
|
||||
- **FR-7.1** The module MUST build as a dynamic module using nginx's `--with-compat --add-dynamic-module=...` flow, against the nginx-dev
|
||||
headers of the target Debian release, so that the resulting `.so` loads into stock upstream nginx on that release without rebuilding
|
||||
nginx itself.
|
||||
- **FR-6.2** The module MUST ship as a Debian package named `libnginx-mod-http-ipng-stats`, following the `libnginx-mod-http-*` naming
|
||||
- **FR-7.2** The module MUST ship as a Debian package named `libnginx-mod-http-ipng-stats`, following the `libnginx-mod-http-*` naming
|
||||
convention used by existing third-party nginx modules packaged for Debian.
|
||||
- **FR-6.3** The package MUST install:
|
||||
- **FR-7.3** The package MUST install:
|
||||
- `/usr/lib/nginx/modules/ngx_http_ipng_stats_module.so`
|
||||
- `/etc/nginx/modules-available/50-mod-http-ipng-stats.conf` containing the `load_module` directive.
|
||||
- A symlink `/etc/nginx/modules-enabled/50-mod-http-ipng-stats.conf → ../modules-available/50-mod-http-ipng-stats.conf` created in the
|
||||
package's postinst.
|
||||
- **FR-6.4** The package postinst MUST run `nginx -t` after installing the module. If the test fails, postinst MUST remove the
|
||||
- **FR-7.4** The package postinst MUST run `nginx -t` after installing the module. If the test fails, postinst MUST remove the
|
||||
`modules-enabled` symlink and report a non-fatal warning so that a broken upgrade does not leave the operator's nginx unable to start.
|
||||
|
||||
**FR-8 Logtail**
|
||||
|
||||
- **FR-8.1** The module MUST support an `ipng_stats_logtail <format_name> udp://host:port [buffer=<size>] [flush=<duration>]` directive
|
||||
at the `http` level that registers a global log-phase writer which fires unconditionally for every request, regardless of which
|
||||
`server` or `location` block handled it. One directive at the `http` level is sufficient to cover all vhosts — operators MUST NOT be
|
||||
required to repeat an `access_log` directive in every `server` block to achieve a single global access log.
|
||||
- **FR-8.2** The `<format_name>` argument MUST be the name of an existing nginx `log_format` defined in the same `http` block before
|
||||
this directive. The module MUST look up the compiled log format from nginx's log module at configuration time and use it to render each
|
||||
log line at request time. The module MUST NOT define its own format language; all `$variable` expansion is handled by nginx's standard
|
||||
log-format machinery, including `$ipng_source_tag` and `$server_addr`.
|
||||
- **FR-8.3** Each worker MUST buffer log lines in a per-worker memory buffer before transmitting them as UDP datagrams. The buffer size
|
||||
is controlled by the optional `buffer=<size>` parameter (default `64k`, minimum `1k`). The buffer MUST be flushed when it is full,
|
||||
when the optional `flush=<duration>` timer fires (default `1s`, minimum `100ms`), or when the worker exits. This ensures that a
|
||||
graceful `nginx -s reload` or a clean worker shutdown transmits all buffered log entries.
|
||||
- **FR-8.4** The destination argument of `ipng_stats_logtail` MUST be a `udp://host:port` URI, where `host` is a literal IPv4 address
|
||||
(no hostnames, no IPv6 in v0.1). Each buffer flush is transmitted as a single `sendto()` call on a per-worker `SOCK_DGRAM` socket
|
||||
opened at worker init and closed at worker exit. If no receiver is listening on the target address and port, the kernel silently
|
||||
discards the datagram — no error is returned, no disk I/O occurs, and the worker is never blocked. Lost datagrams when no receiver is
|
||||
present are intentional; the UDP transport is designed for fire-and-forget analytics pipelines where delivery guarantees are
|
||||
unnecessary and zero disk I/O is preferred over persistence. File-based access logging is not supported by this directive — operators
|
||||
should use nginx's built-in `access_log` for that purpose.
|
||||
|
||||
### Non-Functional Requirements
|
||||
|
||||
**NFR-1 Correctness under concurrency**
|
||||
@@ -242,7 +267,7 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
|
||||
|
||||
- **NFR-7.1** The repository MUST ship a `docs/user-guide.md` that walks an operator through installing the Debian package, loading the
|
||||
module, configuring a minimal end-to-end deployment (GRE tunnels, VIPs, `listen` lines, scrape endpoint), verifying that counters are
|
||||
flowing, and integrating the scrape endpoint with both `maglevd-frontend` and a standalone Prometheus scraper. The user guide is the
|
||||
flowing, and integrating the scrape endpoint with Prometheus and other consumers. The user guide is the
|
||||
document an operator reads once to get from a freshly-installed package to a working, observable deployment.
|
||||
- **NFR-7.2** The repository MUST ship a `docs/config-guide.md` that enumerates every directive and `listen` parameter introduced by the
|
||||
module, together with the nginx configuration contexts (`http`, `server`, `location`, or `listen`) in which each is legal, the allowed
|
||||
@@ -265,26 +290,31 @@ There is no daemon, no socket the module listens on, no control plane. Everythin
|
||||
|
||||
Requests enter nginx through one of two listener classes:
|
||||
|
||||
1. **Device-bound listeners** (`listen ... device=X source=Y`) accept only connections whose ingress interface is `X`. Each is tagged
|
||||
1. **Device-bound listeners** (`listen ... device=X ipng_source_tag=Y`) accept only connections whose ingress interface is `X`. Each is tagged
|
||||
with a source string `Y`.
|
||||
2. **Wildcard fallback listeners** (`listen 80;`, `listen [::]:80;`) accept everything that didn't match a more specific listener. They
|
||||
are tagged with the configured default source (FR-1.3).
|
||||
|
||||
During request processing nginx behaves exactly as it would without the module: no handler runs early, no header is rewritten. At log
|
||||
phase, the module's log-phase handler increments the worker-local counter table keyed by `(source, vip, status_code)`.
|
||||
phase, the module's log-phase handler runs two independent responsibilities:
|
||||
|
||||
1. **Counter update** — increments the worker-local counter table keyed by `(source, vip, status_code)`.
|
||||
2. **Logtail write** — if `ipng_stats_logtail` is configured (FR-8), renders the named `log_format` for this request and appends the
|
||||
resulting line to the per-worker write buffer. The buffer is flushed as a UDP datagram on a timer, when full, or on worker exit
|
||||
(FR-8.3, FR-8.4). This runs for every request regardless of `server` or `location` context.
|
||||
|
||||
A per-worker timer, firing at the configured flush interval (FR-5.2), walks the dirty keys in the worker-local table and applies their
|
||||
deltas to the shared-memory zone via atomic adds.
|
||||
deltas to the shared-memory zone via atomic adds. The same timer triggers a logtail buffer flush if the flush duration has elapsed (FR-8.3).
|
||||
|
||||
The scrape handler, when invoked at `GET /ipng-stats` (or whatever location the operator chose), reads the shared-memory zone directly
|
||||
The scrape handler, when invoked at `GET /.well-known/ipng/statsz` (or whatever location the operator chose), reads the shared-memory zone directly
|
||||
and formats the output per the requested content type.
|
||||
|
||||
`maglevd-frontend` fetches the scrape endpoint of each backend in its configured fleet at roughly the same cadence it already uses for
|
||||
maglevd state. It filters server-side via `?source=<its own tag>` so that it only sees the traffic it delivered. The aggregated view is
|
||||
rendered alongside the existing maglev status page.
|
||||
Scrape consumers fetch the endpoint at their configured cadence, optionally filtering via `?source_tag=<tag>` so that each consumer only
|
||||
sees the traffic it delivered.
|
||||
|
||||
No component in this project writes to anything outside nginx's own memory. In particular, the module does not touch the file system,
|
||||
does not emit log lines on the request path, and does not speak to any upstream.
|
||||
Aside from the logtail output (FR-8) — which sends UDP datagrams to a configured receiver — no component in this
|
||||
project writes to anything outside nginx's own memory. The module does not emit log lines on the request path for the counter subsystem
|
||||
and does not speak to any upstream.
|
||||
|
||||
## Components
|
||||
|
||||
@@ -295,7 +325,7 @@ dynamic-module ABI.
|
||||
|
||||
#### Responsibilities
|
||||
|
||||
- Parse new `listen` parameters `device=` and `source=` and attach their values to each listening socket's config (FR-1.1, FR-1.2).
|
||||
- Parse new `listen` parameters `device=` and `ipng_source_tag=` and attach their values to each listening socket's config (FR-1.1, FR-1.2).
|
||||
- Call `setsockopt(SO_BINDTODEVICE)` in the master process at bind time for listeners that set `device=` (FR-1.1, NFR-6.1).
|
||||
- Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_code)` (FR-2.1, NFR-1.1).
|
||||
- Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2).
|
||||
@@ -305,22 +335,22 @@ dynamic-module ABI.
|
||||
|
||||
#### Attribution Model
|
||||
|
||||
The module's single novel idea is that per-maglev attribution is done by the Linux kernel's TCP socket lookup, not by any userspace
|
||||
inspection. Each `maglevd` instance terminates its GRE tunnel on a dedicated interface on the nginx host; the operator writes one
|
||||
`listen ... device=<ifname> source=<tag>` line per `(family, tunnel)` pair. The kernel binds that listening socket with `SO_BINDTODEVICE`,
|
||||
which causes it to match only connections whose ingress interface is that tunnel. A wildcard `listen 80;` and `listen [::]:80;` pair
|
||||
provides the fallback for traffic arriving on any other interface — typically normal web traffic, not from maglev.
|
||||
The module's single novel idea is that per-device attribution is done by the Linux kernel's TCP socket lookup, not by any userspace
|
||||
inspection. Each traffic source that should be tracked separately terminates on a dedicated interface on the nginx host; the operator
|
||||
writes one `listen ... device=<ifname> ipng_source_tag=<tag>` line per `(family, interface)` pair. The kernel binds that listening socket
|
||||
with `SO_BINDTODEVICE`, which causes it to match only connections whose ingress interface is that device. A wildcard `listen 80;` and
|
||||
`listen [::]:80;` pair provides the fallback for traffic arriving on any other interface — typically normal web traffic.
|
||||
|
||||
The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback
|
||||
and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to.
|
||||
|
||||
Because the `device=` binding uses a wildcard address, the module does not need to know the set of VIPs served through each tunnel.
|
||||
Because the `device=` binding uses a wildcard address, the module does not need to know the set of VIPs served through each interface.
|
||||
Adding a VIP (binding an address to `lo` and writing a new `server_name` block) does not require touching the `listen` lines. Adding a
|
||||
new maglev instance (a new GRE tunnel) does. This is the correct split: VIPs are vhost-level concerns and change often; maglev instances
|
||||
are fleet-level concerns and change rarely.
|
||||
new attributed interface does. This is the correct split: VIPs are vhost-level concerns and change often; interfaces are
|
||||
infrastructure-level concerns and change rarely.
|
||||
|
||||
The design assumes GRE tunnels used as `device=` sources carry **only** maglev-originated traffic. Any other traffic arriving on such an
|
||||
interface is silently misattributed to that maglev's source tag. This is a deployment invariant, not a defect.
|
||||
The design assumes interfaces used as `device=` sources carry **only** traffic from the expected source. Any other traffic arriving on
|
||||
such an interface is silently misattributed to that interface's source tag. This is a deployment invariant, not a defect.
|
||||
|
||||
#### Counter Data Model
|
||||
|
||||
@@ -344,7 +374,7 @@ endpoint can recover the original strings without re-parsing configuration.
|
||||
|
||||
String interning is capacity-bounded: the zone is sized by the operator, and once capacity is exhausted new keys are dropped with a
|
||||
counter bump and an infrequent log line (NFR-3.1). In practice, the number of distinct VIPs on a single nginx host is small (tens, maybe
|
||||
low hundreds), and the number of distinct source tags is the number of maglev instances (single digits). The dominant factor is
|
||||
low hundreds), and the number of distinct source tags is the number of attributed interfaces (single digits). The dominant factor is
|
||||
`status_code`; ~60 keys per VIP is a typical steady state.
|
||||
|
||||
#### Hot Path
|
||||
@@ -408,7 +438,7 @@ The worker never walks the entire table — only dirty slots — so idle VIPs co
|
||||
|
||||
The `ipng_stats` handler is a leaf content handler. It:
|
||||
|
||||
1. Parses `?source=` and `?vip=` into exact-match filters.
|
||||
1. Parses `?source_tag=` and `?vip=` into exact-match filters.
|
||||
2. Parses `Accept:` to pick output format.
|
||||
3. Walks the shared-memory zone under a shared lock (readers hold the read side of a rwlock; flushes and interners hold the write side
|
||||
briefly).
|
||||
@@ -423,10 +453,10 @@ fixed-size buffer per chain link and requests new links only when full.
|
||||
|
||||
- **One nginx content handler**, `ipng_stats`, usable in any `location` block. Serves Prometheus text and JSON, filtered by optional
|
||||
query parameters.
|
||||
- **Two new `listen` parameters**, `device=` and `source=`, usable anywhere a `listen` directive is used.
|
||||
- **Two new `listen` parameters**, `device=` and `ipng_source_tag=`, usable anywhere a `listen` directive is used.
|
||||
- **Five new `http`-level directives**: `ipng_stats_zone`, `ipng_stats_flush_interval`, `ipng_stats_default_source`,
|
||||
`ipng_stats_buckets`, `ipng_stats` (on/off).
|
||||
- **A Prometheus metric family** prefixed `nginx_ipng_*`, labelled `source`, `vip`, and (for request counters) `code`.
|
||||
- **A Prometheus metric family** prefixed `nginx_ipng_*`, labelled `source_tag`, `vip`, and (for request counters) `code`.
|
||||
|
||||
**Consumes.**
|
||||
|
||||
@@ -442,10 +472,10 @@ Debian is the target and upstream nginx on Debian is the platform.
|
||||
#### Responsibilities
|
||||
|
||||
- Build the module against the target release's nginx-dev headers with `--with-compat` (NFR-5.1, NFR-5.3).
|
||||
- Install the compiled `.so` into `/usr/lib/nginx/modules` (FR-6.3).
|
||||
- Install the compiled `.so` into `/usr/lib/nginx/modules` (FR-7.3).
|
||||
- Drop a `load_module` stanza into `/etc/nginx/modules-available/` and enable it by default via a symlink in `modules-enabled/`
|
||||
(FR-6.3).
|
||||
- Sanity-check the resulting config with `nginx -t` in the postinst and back out cleanly if it fails (FR-6.4).
|
||||
(FR-7.3).
|
||||
- Sanity-check the resulting config with `nginx -t` in the postinst and back out cleanly if it fails (FR-7.4).
|
||||
|
||||
#### Build
|
||||
|
||||
@@ -476,12 +506,13 @@ No nginx binary is produced, shipped, or touched. The package is strictly additi
|
||||
|
||||
A typical deployment on a single nginx host looks like:
|
||||
|
||||
- One GRE tunnel per maglev instance, terminated on the nginx host by the operator's networking layer (systemd-networkd, Netplan, or a
|
||||
hand-rolled interface config). Interface names follow a consistent pattern, typically `gre-<tag>` — e.g. `gre-mg1`, `gre-mg2`.
|
||||
- VIPs bound to a local dummy or loopback interface so the kernel accepts inner packets destined for them.
|
||||
- A hand-maintained `listen` include file with one device-bound listen per `(family, tunnel)` pair, reused across vhosts.
|
||||
- One interface per traffic source that should be separately attributed (e.g. GRE tunnels, VLANs), set up by the operator's networking
|
||||
layer (systemd-networkd, Netplan, or a hand-rolled interface config). Interface names follow a consistent pattern, typically
|
||||
`gre-<tag>` — e.g. `gre-mg1`, `gre-mg2`.
|
||||
- VIPs bound to a local dummy or loopback interface so the kernel accepts packets destined for them.
|
||||
- A hand-maintained `listen` include file with one device-bound listen per `(family, interface)` pair, reused across vhosts.
|
||||
- Fallback `listen 80;` and `listen [::]:80;` in whichever server blocks serve direct web traffic.
|
||||
- A single scrape location, e.g. `location = /ipng-stats`, served from a locked-down server block that only allows the maglev fleet and
|
||||
- A single scrape location, e.g. `location = /.well-known/ipng/statsz`, served from a locked-down server block that only allows scrape consumers and
|
||||
the local Prometheus scraper.
|
||||
|
||||
### Configuration
|
||||
@@ -497,7 +528,7 @@ http {
|
||||
server {
|
||||
listen 80;
|
||||
listen [::]:80;
|
||||
include /etc/nginx/ipng-maglev/listens.conf;
|
||||
include /etc/nginx/ipng-stats/listens.conf;
|
||||
|
||||
server_name _;
|
||||
# ... normal vhost content
|
||||
@@ -505,17 +536,17 @@ http {
|
||||
|
||||
server {
|
||||
listen 127.0.0.1:9113;
|
||||
location = /ipng-stats {
|
||||
location = /.well-known/ipng/statsz {
|
||||
ipng_stats;
|
||||
allow 127.0.0.1;
|
||||
allow 2001:db8::/48; # maglev fleet
|
||||
allow 2001:db8::/48; # scrape consumers
|
||||
deny all;
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`listens.conf` is eight lines (two families × four maglevs) and stable across vhost changes.
|
||||
`listens.conf` is two lines per attributed interface (two address families each) and stable across vhost changes.
|
||||
|
||||
### Nginx Reload Semantics
|
||||
|
||||
@@ -550,15 +581,15 @@ some other endpoint.
|
||||
- **nginx master crash / package upgrade.** The shared zone is torn down with the old master. When the new master starts, the zone is
|
||||
recreated empty. Counters start from zero. Consumers that need history SHOULD read from Prometheus, which retains history across
|
||||
restarts.
|
||||
- **Device disappears.** If an operator removes a GRE tunnel without removing its `listen` line, nginx's bind will fail on the next
|
||||
- **Device disappears.** If an operator removes an interface without removing its `listen` line, nginx's bind will fail on the next
|
||||
reload and the reload will error cleanly. The module does not hide this; a failing `nginx -t` is the right answer.
|
||||
- **Traffic on a wildcard listener that should have been device-bound.** The traffic is counted under `direct` (or the configured
|
||||
default). This is detectable: if the operator expects zero traffic under `direct` and the dashboard shows non-zero, a maglev instance
|
||||
is probably missing from the `listen` include.
|
||||
default). This is detectable: if the operator expects zero traffic under `direct` and the dashboard shows non-zero, an interface is
|
||||
probably missing from the `listen` include.
|
||||
- **Slow scrape on a large zone.** Scrape cost is linear in the number of keys (NFR-2.3). On a host with a very large VIP count, the
|
||||
operator SHOULD increase the flush interval, lower the scrape frequency, or both. The module does not cap scrape runtime.
|
||||
- **Maglev frontend is down.** The module is unaffected; its counters continue to increment and the Prometheus scrape continues to work.
|
||||
When the frontend comes back, it resumes fetching. No state is lost.
|
||||
- **Scrape consumer is down.** The module is unaffected; its counters continue to increment and the Prometheus scrape continues to work.
|
||||
When the consumer comes back, it resumes fetching. No state is lost.
|
||||
|
||||
### Security
|
||||
|
||||
@@ -586,18 +617,16 @@ some other endpoint.
|
||||
decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked, `SO_MARK` on an
|
||||
accepted socket does not reflect incoming packet or conntrack mark without a per-packet `libnetfilter_conntrack` lookup, which is too
|
||||
heavy for a log-phase handler.
|
||||
- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than `SO_BINDTODEVICE`: it still requires per-maglev
|
||||
- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than `SO_BINDTODEVICE`: it still requires per-source
|
||||
tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `SO_BINDTODEVICE` solves the same problem with
|
||||
kernel primitives nginx already knows about.
|
||||
- **Attribution via eBPF `SO_REUSEPORT` programs.** Rejected as dramatic overkill for a problem the kernel already solves for free via
|
||||
socket-lookup specificity.
|
||||
- **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1;`. The wildcard form works
|
||||
- **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1 ipng_source_tag=mg1;`. The wildcard form works
|
||||
because nginx routes by `server_name` post-accept, so the `listen` only needs to express `(port, device)` and does not need the VIP
|
||||
address. This makes the generated include file size independent of the VIP count.
|
||||
- **Pushing counters from the module into `maglevd` over gRPC.** Rejected. It inverts the wait-for graph (maglevd's design doc is
|
||||
careful to keep the daemon free of callbacks from the backends), it complicates restart neutrality, and it adds a gRPC client to a C
|
||||
module. Pull-based scrape keeps maglevd out of the traffic-metrics business, matches the doc's philosophy, and lets the frontend use
|
||||
its existing per-server goroutine model.
|
||||
- **Pushing counters to an external daemon over gRPC.** Rejected. It complicates restart neutrality and adds a gRPC client dependency to
|
||||
a C module. Pull-based scrape is simpler: consumers fetch when they want, and the module has no outbound connections.
|
||||
- **Shipping separate JSON and Prometheus handlers.** Rejected. Content negotiation on one handler is simpler to configure and serves
|
||||
both audiences from one ACL.
|
||||
|
||||
@@ -609,5 +638,5 @@ some other endpoint.
|
||||
- **TLS handshake metrics.** The module reports `request_duration` from the start of the HTTP request, not from TCP accept. For
|
||||
TLS-terminating frontends a handshake-time fraction is invisible. Adding a `tls_handshake_duration` histogram is deferred until
|
||||
operators ask for it.
|
||||
- **`maglevd-frontend` fetch cadence.** Whichever cadence the frontend adopts for traffic counters — the existing ~one-second refresh,
|
||||
or an SSE bridge layered on top — the plugin supports it. The choice is on the frontend side.
|
||||
- **Consumer fetch cadence.** Whichever cadence a consumer adopts for traffic counters — a one-second refresh, a longer Prometheus
|
||||
scrape interval, or an SSE bridge layered on top — the plugin supports it. The choice is on the consumer side.
|
||||
|
||||
Reference in New Issue
Block a user