703 lines
49 KiB
Markdown
703 lines
49 KiB
Markdown
<!-- SPDX-License-Identifier: Apache-2.0 -->
|
||
# nginx-ipng-stats-plugin Design Document
|
||
|
||
## Metadata
|
||
|
||
| | |
|
||
| --- | --- |
|
||
| **Status** | Draft — describes intended behavior for `v0.2.0` |
|
||
| **Author** | Pim van Pelt `<pim@ipng.ch>` |
|
||
| **Last updated** | 2026-04-16 |
|
||
| **Audience** | Operators and contributors deploying per-device, per-VIP traffic observability on nginx |
|
||
|
||
The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, and **MAY** are used as described in
|
||
[RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119), and are reserved in this document for requirements that are intended to be
|
||
enforced in code or by an external dependency. Plain-language descriptions of what the system or an operator can do are written in
|
||
lowercase — "can", "will", "does" — and should not be read as normative.
|
||
|
||
## Summary
|
||
|
||
`nginx-ipng-stats-plugin` is a dynamic nginx module and its surrounding Debian packaging. Loaded into stock upstream nginx, the module
|
||
records per-VIP traffic counters — requests, status codes, bytes, latency — and attributes them to the specific interface on which each
|
||
connection arrived. A small HTTP scrape endpoint exposes the counters as both Prometheus text and JSON so that Prometheus, custom
|
||
dashboards, and ad-hoc `curl` sessions can all read the same data.
|
||
|
||
## Background
|
||
|
||
Any deployment where traffic arrives on distinct Linux interfaces — GRE tunnels, VLANs, VXLANs, bonded links, or plain ethernet — can
|
||
benefit from per-interface traffic visibility. The nginx instances that serve the traffic already observe everything an operator wants to
|
||
see — they are the authoritative source for request rate, response code mix, bytes moved, and latency distributions. A small in-process
|
||
module emits those numbers on an HTTP endpoint, and consumers scrape the data filtered by source tag.
|
||
|
||
One motivating use case is [`vpp-maglev`](https://git.ipng.ch/ipng/vpp-maglev), where each load-balancer instance terminates a GRE
|
||
tunnel on the nginx host. The module attributes traffic per tunnel, letting the frontend show per-backend counters that VPP's fast path
|
||
cannot provide. But the module is not coupled to that use case — it works with any interface type and any consumer.
|
||
|
||
## Goals and Non-Goals
|
||
|
||
### Product Goals
|
||
|
||
1. **Per-VIP, per-device traffic visibility.** For each VIP, the module records request count, status-code distribution, bytes in and
|
||
out, and request-duration histograms, split by which interface delivered the traffic.
|
||
2. **Negligible hot-path cost.** At steady state, a request traversing an nginx worker with the module loaded pays at most a handful of
|
||
non-atomic integer increments and a histogram bucket update. No locks, no allocations, no system calls.
|
||
3. **Two readers, one endpoint.** A single HTTP location serves both Prometheus text and JSON, so a site running Prometheus and a site
|
||
using a custom consumer can both consume the module without extra configuration.
|
||
4. **Packaging as a dynamic module.** The module builds with nginx's `--with-compat` ABI and ships as a Debian package that loads into
|
||
stock upstream nginx without recompiling nginx itself.
|
||
5. **Composable with normal nginx use.** A host running the module with device-bound listeners **and** serving unrelated direct web
|
||
traffic on the same ports MUST remain a correct nginx deployment. The module MUST NOT change the semantics of any existing directive;
|
||
it only adds new parameters and directives that are no-ops when unused.
|
||
6. **Graceful reload.** An `nginx -s reload` MUST NOT reset counters, lose history, or drop in-flight connections from the module's point
|
||
of view.
|
||
|
||
### Non-Goals
|
||
|
||
- The module is **not** a generic nginx metrics exporter. It does not aim to replace `nginx-module-vts`, `ngx_http_stub_status`, or
|
||
`nginx-lua-prometheus`. Its metric set is deliberately narrow: per-VIP, per-device counters, histograms, and rate gauges.
|
||
- The module does **not** terminate TLS, rewrite headers, or alter the request in any way. It is observation-only.
|
||
- The module does **not** talk to any external daemon. It does not initiate gRPC or read any external configuration. The attribution tag
|
||
it emits is a string supplied by the operator in the `listen` directive; nothing more.
|
||
- The module does **not** provide per-client-IP, per-path, or per-User-Agent counters. Those dimensions explode cardinality and belong in
|
||
access logs and existing log-analysis tools.
|
||
- The module does **not** provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on
|
||
restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus.
|
||
- The module does **not** own the interfaces or the VIP addresses. Interface creation, VIP binding, and nginx master privileges
|
||
are the operator's responsibility.
|
||
|
||
## Requirements
|
||
|
||
Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that later sections can cite it.
|
||
|
||
### Functional Requirements
|
||
|
||
**FR-1 Attribution**
|
||
|
||
- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=<ifname>`, which records a mapping
|
||
between the named interface and the listen's source tag. Attribution at request time MUST read the ingress ifindex from the
|
||
kernel's `IP_PKTINFO` / `IPV6_PKTINFO` cmsg (enabled on every HTTP listening socket) and match against that mapping. The module
|
||
MUST NOT apply `SO_BINDTODEVICE` to the listening socket: the option pins both ingress and egress, which breaks DSR / maglev
|
||
deployments where the SYN arrives via a GRE tunnel but the SYN-ACK must leave via the default route. A listen directive without
|
||
`device=` MUST create a plain listening socket as stock nginx does.
|
||
- **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `ipng_source_tag=<tag>`, which attaches a short string tag to
|
||
the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener.
|
||
- **FR-1.3** A listening socket with neither `device=` nor `ipng_source_tag=` MUST be tagged with the configured default source string (see
|
||
`ipng_stats_default_source`, FR-5.3). The default default is the literal string `direct`.
|
||
- **FR-1.4** A listening socket with `device=X` but no `ipng_source_tag=` MUST be tagged with the interface name `X`.
|
||
- **FR-1.5** Two or more `listen` directives sharing `address:port` MUST coexist regardless of whether each carries `device=`.
|
||
Since no `SO_BINDTODEVICE` is applied, the kernel delivers all matching SYNs to a single wildcard listening socket and the
|
||
module distinguishes them by reading `ifindex` from the per-connection cmsg. The listen wrapper therefore deduplicates on
|
||
`(server block, sockaddr)` across both plain and device-tagged listens: the first occurrence registers the kernel socket,
|
||
and subsequent same-sockaddr siblings (plain or device-tagged) are accepted without tripping nginx's duplicate-listen check.
|
||
Device-tagged siblings additionally register an entry in the attribution table.
|
||
- **FR-1.5a** When the *same* sockaddr is listen'd from a *different* `server { ... }` block — the shared `include` pattern —
|
||
the wrapper MUST strip socket-level options from `cf->args` before delegating to the core listen handler. These options
|
||
(`reuseport`, `bind`, `backlog=`, `rcvbuf=`, `sndbuf=`, `setfib=`, `fastopen=`, `accept_filter=`, `deferred`, `ipv6only=`,
|
||
`so_keepalive=`) apply to the one kernel socket that backs the sockaddr and nginx rejects any attempt to set them more
|
||
than once per sockaddr with `duplicate listen options for <addr>` (see `ngx_http_add_addresses` in `src/http/ngx_http.c`).
|
||
The first cscf to hit the sockaddr owns the options; subsequent cscfs pass the core handler with the options removed and
|
||
merge via `ngx_http_add_server`. Protocol-level flags (`ssl`, `http2`, `quic`, `proxy_protocol`) are preserved on every call
|
||
because nginx merges them with OR semantics across cscfs.
|
||
- **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=<ifname>` MUST attribute
|
||
every connection whose ingress interface is `<ifname>` — regardless of which local address the client addressed — to that
|
||
listen's source tag. Traffic on other interfaces MUST fall back to the configured default source (see FR-1.3).
|
||
|
||
**FR-2 Counters**
|
||
|
||
- **FR-2.1** The module MUST maintain, for every observed `(source, vip, status_class)` tuple, the following counters: total requests,
|
||
total bytes received (sum of request bytes including request line, headers, and body), total bytes sent (sum of response bytes
|
||
including status line, headers, and body), and sum of request durations in milliseconds (exported as `nginx_ipng_latency_total`).
|
||
The module MUST additionally maintain, per `(source, vip)` pair (no `code` label), fixed-bucket histograms of request duration in
|
||
milliseconds and of request/response sizes in bytes.
|
||
- **FR-2.2** When an upstream is used to serve the request, the module MUST additionally maintain a fixed-bucket histogram of upstream
|
||
response time in milliseconds, keyed by the same `(source, vip)` pair.
|
||
- **FR-2.3** The duration histogram bucket boundaries MUST be fixed at module initialization and MUST be the same for every `(source,
|
||
vip)` key. The default boundaries are `{1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000}` milliseconds plus an implicit
|
||
`+Inf` bucket. Operators MAY override the boundaries via the `ipng_stats_buckets` directive at the `http` level. The byte-size
|
||
histograms (request and response bodies) use independent bounds defaulting to `{100, 1000, 10000, 100000, 1000000, 10000000}` bytes;
|
||
`ipng_stats_byte_buckets` overrides them.
|
||
- **FR-2.4** The module MUST additionally maintain, per `(source, vip)` pair, exponentially-weighted moving averages for instantaneous
|
||
request rate with decay windows of 1 second, 10 seconds, and 60 seconds. EWMAs are updated from the periodic flush tick (see FR-4.2),
|
||
not from the request path.
|
||
- **FR-2.5** The `vip` dimension of every counter MUST be the connection's `$server_addr` in its canonical textual form (dotted-quad for
|
||
IPv4, RFC 5952 lowercase-compressed form for IPv6). IPv6 zone identifiers (scope-ids), if any, MUST be stripped during canonicalization;
|
||
link-local VIPs (which are not expected in practice) are attributed under their scope-less textual form. Port is not part of the key;
|
||
a VIP that listens on both 80 and 443 MUST be aggregated.
|
||
- **FR-2.6** The `status_code` dimension MUST be bucketed into a single class label: `1xx`, `2xx`, `3xx`, `4xx`, `5xx`, or `unknown` for
|
||
codes outside `[100, 599]`. This bounds per-`(source, vip)` cardinality to six lanes regardless of how many distinct three-digit
|
||
codes are observed. Operators who need a full per-code breakdown SHOULD enable `ipng_stats_logtail` (FR-8) and derive the per-code
|
||
view from the access-log stream off the hot path; the stats zone intentionally trades that resolution for a much smaller scrape
|
||
response.
|
||
- **FR-2.7** The module MUST additionally maintain, per `(source, vip)` pair, three point-in-time gauges of requests currently in
|
||
flight: `active` (observed at `POST_READ` but not yet finalized), `reading` (in the pre-response phases — rewrite/access/content),
|
||
and `writing` (past the header-send transition). The invariant `reading + writing = active` MUST hold at any instant. Subrequests
|
||
and internal redirects MUST NOT double-count the parent request. Updates to the gauges are atomic increments/decrements on the
|
||
request lifecycle hooks — no slab lock after the first time a `(source, vip)` pair is seen — so the hot-path rule in FR-4.1 still
|
||
holds for ordinary counter updates while gauges are maintained lock-free.
|
||
|
||
**FR-3 Scrape endpoint**
|
||
|
||
- **FR-3.1** The module MUST provide a new nginx handler directive, `ipng_stats;`, that, when placed in a `location` block, causes that
|
||
location to serve the module's counters and MUST NOT be combinable with other content handlers in the same location.
|
||
- **FR-3.2** The `ipng_stats` handler MUST support content negotiation via the `Accept` request header:
|
||
- `Accept: application/json` → JSON output.
|
||
- `Accept: text/plain` (or anything else, including absent) → Prometheus text exposition format.
|
||
- **FR-3.3** The handler MUST support a `source_tag=<tag>` query parameter that filters the output to only counters whose source dimension
|
||
equals the supplied tag. The comparison is exact-match and case-sensitive.
|
||
- **FR-3.4** The handler MUST support a `vip=<address>` query parameter that filters the output to only counters whose VIP dimension
|
||
equals the supplied address. The comparison uses the canonicalized form of FR-2.5.
|
||
- **FR-3.5** Both filters MAY be supplied together; their effect is the intersection.
|
||
- **FR-3.6** The JSON schema MUST be documented in `docs/scrape-api.md` and MUST version via a top-level `schema` field so that breaking
|
||
changes can be made additively without bricking existing consumers.
|
||
- **FR-3.7** The Prometheus text output MUST use stable metric names prefixed with `nginx_ipng_` and MUST label every series with
|
||
`source_tag` and `vip`. Counter metrics (`nginx_ipng_requests_total`, `nginx_ipng_bytes_{in,out}_total`, `nginx_ipng_latency_total`)
|
||
additionally carry a `code` label with a class value (`1xx`..`5xx`/`unknown`). Histogram series (duration, upstream response,
|
||
request/response byte size) MUST NOT carry a `code` label — they aggregate across all classes for a given `(source, vip)` pair.
|
||
Gauge series (`nginx_ipng_active`, `nginx_ipng_reading`, `nginx_ipng_writing`) MUST be labelled with `source_tag` and `vip` only
|
||
(no `code`) and MUST be typed as `gauge` in the exposition preamble.
|
||
|
||
**FR-4 Hot path and flush**
|
||
|
||
- **FR-4.1** Per-request counter updates MUST occur in the nginx log phase and MUST be localized to the current worker's private counter
|
||
table. The module MUST NOT take any locks on the request path and MUST NOT issue any atomic operation on the request path.
|
||
- **FR-4.2** Each worker MUST run a periodic timer, default one second, that flushes the worker's private counter deltas into the
|
||
shared-memory zone using atomic adds. The flush interval is configurable via the `ipng_stats_flush_interval` directive.
|
||
- **FR-4.3** The scrape handler MUST read only from the shared-memory zone. Workers MUST NOT read from each other's private tables.
|
||
- **FR-4.4** Histogram updates MUST be branch-light: the module MUST precompute a small lookup that maps elapsed milliseconds to a bucket
|
||
index using binary search over the fixed boundary array, and MUST NOT scan the array linearly.
|
||
|
||
**FR-5 Directives**
|
||
|
||
- **FR-5.1** `ipng_stats_zone name:size` at the `http` level declares the shared-memory zone the module uses. `name` is the zone name (no
|
||
default); `size` is a size with suffix (`k`, `m`). The directive is mandatory if the module is loaded.
|
||
- **FR-5.2** `ipng_stats_flush_interval <duration>` at the `http` level sets the worker flush cadence. Default `1s`. Minimum `100ms`.
|
||
- **FR-5.3** `ipng_stats_default_source <tag>` at the `http` level sets the tag applied to listening sockets that have neither `device=`
|
||
nor `ipng_source_tag=`. Default `direct`.
|
||
- **FR-5.4** `ipng_stats_buckets <ms ms ms ...>` at the `http` level overrides the default histogram bucket boundaries. Values MUST be
|
||
strictly increasing positive integers.
|
||
- **FR-5.5** `ipng_stats on|off` at the `http`, `server`, or `location` level opts a context into or out of counting. Default `on` at the
|
||
`http` level when the module is loaded. A location serving the `ipng_stats` handler MUST NOT have itself counted (the module
|
||
automatically sets `off` for the scrape location).
|
||
|
||
**FR-6 Variables**
|
||
|
||
- **FR-6.1** The module MUST register an nginx variable `$ipng_source_tag` that resolves to the source tag of the listening socket that
|
||
accepted the current connection. For device-bound listeners this is the `ipng_source_tag=` value (or the `device=` name if
|
||
`ipng_source_tag=` was not set); for wildcard fallback listeners this is the value of `ipng_stats_default_source`. The variable is
|
||
usable in `log_format`, `map`, `add_header`, `if`, and any other nginx context that accepts variables.
|
||
- **FR-6.2** `$ipng_source_tag` MUST be available unconditionally when the module is loaded, even if `ipng_stats_zone` is not
|
||
configured. It does not depend on the counter subsystem; it only depends on the listen-parameter parsing. Operators who need the VIP
|
||
address should use nginx's built-in `$server_addr` variable.
|
||
|
||
**FR-7 Packaging**
|
||
|
||
- **FR-7.1** The module MUST build as a dynamic module using nginx's `--with-compat --add-dynamic-module=...` flow, against the nginx-dev
|
||
headers of the target Debian release, so that the resulting `.so` loads into stock upstream nginx on that release without rebuilding
|
||
nginx itself.
|
||
- **FR-7.2** The module MUST ship as a Debian package named `libnginx-mod-http-ipng-stats`, following the `libnginx-mod-http-*` naming
|
||
convention used by existing third-party nginx modules packaged for Debian.
|
||
- **FR-7.3** The package MUST install:
|
||
- `/usr/lib/nginx/modules/ngx_http_ipng_stats_module.so`
|
||
- `/etc/nginx/modules-available/50-mod-http-ipng-stats.conf` containing the `load_module` directive.
|
||
- A symlink `/etc/nginx/modules-enabled/50-mod-http-ipng-stats.conf → ../modules-available/50-mod-http-ipng-stats.conf` created in the
|
||
package's postinst.
|
||
- **FR-7.4** The package postinst MUST run `nginx -t` after installing the module. If the test fails, postinst MUST remove the
|
||
`modules-enabled` symlink and report a non-fatal warning so that a broken upgrade does not leave the operator's nginx unable to start.
|
||
|
||
**FR-8 Logtail**
|
||
|
||
- **FR-8.1** The module MUST support an `ipng_stats_logtail <format_name> udp://host:port [buffer=<size>] [flush=<duration>] [if=$var]`
|
||
directive at the `http` level that registers a global log-phase writer which fires for every request (unless suppressed by `if=`),
|
||
regardless of which `server` or `location` block handled it. One directive at the `http` level is sufficient to cover all vhosts —
|
||
operators MUST NOT be required to repeat an `access_log` directive in every `server` block to achieve a single global access log.
|
||
- **FR-8.2** The `<format_name>` argument MUST be the name of an existing nginx `log_format` defined in the same `http` block before
|
||
this directive. The module MUST look up the compiled log format from nginx's log module at configuration time and use it to render each
|
||
log line at request time. The module MUST NOT define its own format language; all `$variable` expansion is handled by nginx's standard
|
||
log-format machinery, including `$ipng_source_tag` and `$server_addr`.
|
||
- **FR-8.3** Each worker MUST buffer log lines in a per-worker memory buffer before transmitting them as UDP datagrams. The buffer size
|
||
is controlled by the optional `buffer=<size>` parameter (default `64k`, minimum `1k`). The buffer MUST be flushed when it is full,
|
||
when the optional `flush=<duration>` timer fires (default `1s`, minimum `100ms`), or when the worker exits. This ensures that a
|
||
graceful `nginx -s reload` or a clean worker shutdown transmits all buffered log entries.
|
||
- **FR-8.4** The destination argument of `ipng_stats_logtail` MUST be a `udp://host:port` URI, where `host` is a literal IPv4 address
|
||
(no hostnames, no IPv6 in v0.1). Each buffer flush is transmitted as a single `sendto()` call on a per-worker `SOCK_DGRAM` socket
|
||
opened at worker init and closed at worker exit. If no receiver is listening on the target address and port, the kernel silently
|
||
discards the datagram — no error is returned, no disk I/O occurs, and the worker is never blocked. Lost datagrams when no receiver is
|
||
present are intentional; the UDP transport is designed for fire-and-forget analytics pipelines where delivery guarantees are
|
||
unnecessary and zero disk I/O is preferred over persistence. File-based access logging is not supported by this directive — operators
|
||
should use nginx's built-in `access_log` for that purpose.
|
||
- **FR-8.5** The directive MAY include an `if=$variable` parameter. When present, the logtail writer MUST evaluate the named nginx
|
||
variable at log phase and MUST suppress the log line if the variable is not found, is empty, or equals the string `"0"`. The
|
||
condition MUST be checked before the log format is rendered, so that filtered requests incur no formatting cost. This follows the same
|
||
semantics as nginx's built-in `access_log ... if=` and is intended for suppressing high-frequency requests (e.g. health checks) from
|
||
the logtail stream. Filtered requests MUST still be counted by the stats module — the `if=` condition affects only logtail output.
|
||
|
||
### Non-Functional Requirements
|
||
|
||
**NFR-1 Correctness under concurrency**
|
||
|
||
- **NFR-1.1** Per-worker counter tables MUST be owned exclusively by their worker and MUST NOT be read or written by any other worker,
|
||
any handler, or any timer other than the worker's own flush timer.
|
||
- **NFR-1.2** Flushes from workers into the shared zone MUST use relaxed atomic `fetch_add` on 64-bit lanes. The module MUST NOT rely on
|
||
`memset`, `memcpy`, or any unaligned access for shared-zone updates.
|
||
- **NFR-1.3** A scrape that races with a flush MUST observe a monotonically non-decreasing counter value; temporary readings that see
|
||
partial flushes across different keys are acceptable, but a single counter MUST never appear to decrease.
|
||
- **NFR-1.4** Histogram bucket counts and sum/count fields MUST be updated in a way that a concurrent scrape never observes
|
||
`count < sum-of-buckets`. This is achieved by updating bucket counts before the sum/count and by a scraper that reads sum/count before
|
||
bucket counts.
|
||
|
||
**NFR-2 Hot-path cost**
|
||
|
||
- **NFR-2.1** The per-request cost of the log-phase handler MUST be bounded by: one listening-socket pointer deref, one VIP pointer deref
|
||
(cached on the connection struct), a constant-time status-code index computation, a constant number of integer increments, and a
|
||
`O(log B)` histogram binary search where `B` is the number of buckets. No syscalls, no allocations, no locks.
|
||
- **NFR-2.2** The per-flush cost per worker MUST be bounded by `O(K)` atomic adds, where `K` is the number of distinct
|
||
`(source, vip, class)` keys touched by that worker since the last flush. Keys untouched during an interval MUST NOT be visited.
|
||
- **NFR-2.3** The scrape cost MUST be bounded by `O(K_total)` reads from the shared zone plus `O(K_total)` string format operations,
|
||
where `K_total` is the number of distinct keys in the zone.
|
||
|
||
**NFR-3 Memory bounds**
|
||
|
||
- **NFR-3.1** The shared-memory zone MUST be sized by the operator at module-load time (FR-5.1) and MUST NOT grow beyond that size. When
|
||
the zone is full, the module MUST drop new keys, increment a dedicated `nginx_ipng_zone_full_events_total` counter, and log at `warn`
|
||
level no more than once per minute per worker.
|
||
- **NFR-3.2** The per-worker private counter table MUST be bounded by the same total key count the shared zone admits. A worker MUST NOT
|
||
accumulate private state that exceeds the shared-zone capacity.
|
||
- **NFR-3.3** Status codes are collapsed to six classes (`1xx`..`5xx`/`unknown`) at counter-update time (FR-2.6), bounding per-`(source,
|
||
vip)` counter cardinality at six lanes regardless of how many three-digit codes are observed. Any code outside `[100, 599]` falls
|
||
into `code="unknown"`. Per-code resolution is available via `ipng_stats_logtail` (FR-8), which operates off the hot path.
|
||
|
||
**NFR-4 Reload neutrality**
|
||
|
||
- **NFR-4.1** `nginx -s reload` spawns a new set of workers while the old workers drain. The shared-memory zone MUST survive this
|
||
transition; counters MUST NOT reset on reload.
|
||
- **NFR-4.2** New workers MUST attach to the existing shared-memory zone under the same name, reconstruct their private counter tables
|
||
lazily from observed traffic, and resume flushing.
|
||
- **NFR-4.3** The `source` tag for any given listening socket is recomputed at reload time from the new configuration. If the operator
|
||
renames a tag, new traffic MUST use the new tag.
|
||
- **NFR-4.4** When a `source` tag is no longer present in any listening socket after a configuration reload, its counters MUST be
|
||
evicted from the shared-memory zone on the first flush tick following the reload. The module MUST NOT retain historical counters under
|
||
defunct tags indefinitely. Rename is expected to be rare and evicting the old entries immediately is acceptable.
|
||
|
||
**NFR-5 Packaging robustness**
|
||
|
||
- **NFR-5.1** The module MUST compile cleanly against the nginx-dev headers of the currently supported Debian stable and testing
|
||
releases. CI MUST build one `.deb` per supported release and MUST fail if any target breaks.
|
||
- **NFR-5.2** The module MUST NOT depend on any shared library beyond `libc` and nginx's own runtime. No `libnetfilter_*`, no `libcurl`,
|
||
no `libjson*`.
|
||
- **NFR-5.3** A version mismatch between the `.so` and the installed nginx binary MUST be detected by nginx at load time (this is the
|
||
purpose of `--with-compat`). The package postinst MUST NOT attempt to work around a mismatch; it reports the failure and leaves the
|
||
operator to upgrade the nginx package.
|
||
|
||
**NFR-6 Security**
|
||
|
||
- **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. `IP_PKTINFO` /
|
||
`IPV6_RECVPKTINFO` are enabled on the listening sockets (ordinary, unprivileged `setsockopt` calls) and the log handler's
|
||
`getsockopt(IP_PKTOPTIONS)` on the accepted fd is also unprivileged.
|
||
- **NFR-6.2** The scrape endpoint MUST be accessible only via an `allow`/`deny` ACL placed in the operator's nginx config. The module
|
||
MUST NOT ship its own authentication or authorization; it is a plain nginx handler and inherits the enclosing `location` block's access
|
||
controls.
|
||
- **NFR-6.3** The module MUST NOT log client IPs, request paths, `User-Agent`, or any other per-request personally-identifying field. It
|
||
logs only aggregate counters and its own operational events.
|
||
|
||
**NFR-7 Documentation**
|
||
|
||
- **NFR-7.1** The repository MUST ship a `docs/user-guide.md` that walks an operator through installing the Debian package, loading the
|
||
module, configuring a minimal end-to-end deployment (GRE tunnels, VIPs, `listen` lines, scrape endpoint), verifying that counters are
|
||
flowing, and integrating the scrape endpoint with Prometheus and other consumers. The user guide is the
|
||
document an operator reads once to get from a freshly-installed package to a working, observable deployment.
|
||
- **NFR-7.2** The repository MUST ship a `docs/config-guide.md` that enumerates every directive and `listen` parameter introduced by the
|
||
module, together with the nginx configuration contexts (`http`, `server`, `location`, or `listen`) in which each is legal, the allowed
|
||
values, the default, and a one-sentence summary of behavior. The config guide is the document an operator greps when they need to know
|
||
where a given knob is allowed to appear.
|
||
|
||
## Architecture Overview
|
||
|
||
### Process Model
|
||
|
||
The project ships one dynamic nginx module:
|
||
|
||
- **`ngx_http_ipng_stats_module.so`** — the dynamic module, loaded by nginx's master at startup via `load_module`. It runs entirely inside
|
||
the nginx process model: code executes in nginx workers during the request lifecycle and during per-worker timers. No separate process
|
||
is launched.
|
||
|
||
There is no daemon, no socket the module listens on, no control plane. Everything the module does is done inline with nginx.
|
||
|
||
### Data Flow
|
||
|
||
Requests enter nginx through one of two listener classes:
|
||
|
||
1. **Device-bound listeners** (`listen ... device=X ipng_source_tag=Y`) accept only connections whose ingress interface is `X`. Each is tagged
|
||
with a source string `Y`.
|
||
2. **Wildcard fallback listeners** (`listen 80;`, `listen [::]:80;`) accept everything that didn't match a more specific listener. They
|
||
are tagged with the configured default source (FR-1.3).
|
||
|
||
During request processing nginx behaves exactly as it would without the module: no handler runs early, no header is rewritten. At log
|
||
phase, the module's log-phase handler runs two independent responsibilities:
|
||
|
||
1. **Counter update** — increments the worker-local counter table keyed by `(source, vip, status_code)`.
|
||
2. **Logtail write** — if `ipng_stats_logtail` is configured (FR-8), renders the named `log_format` for this request and appends the
|
||
resulting line to the per-worker write buffer. The buffer is flushed as a UDP datagram on a timer, when full, or on worker exit
|
||
(FR-8.3, FR-8.4). This runs for every request regardless of `server` or `location` context.
|
||
|
||
A per-worker timer, firing at the configured flush interval (FR-5.2), walks the dirty keys in the worker-local table and applies their
|
||
deltas to the shared-memory zone via atomic adds. The same timer triggers a logtail buffer flush if the flush duration has elapsed (FR-8.3).
|
||
|
||
The scrape handler, when invoked at `GET /.well-known/ipng/statsz` (or whatever location the operator chose), reads the shared-memory zone directly
|
||
and formats the output per the requested content type.
|
||
|
||
Scrape consumers fetch the endpoint at their configured cadence, optionally filtering via `?source_tag=<tag>` so that each consumer only
|
||
sees the traffic it delivered.
|
||
|
||
Aside from the logtail output (FR-8) — which sends UDP datagrams to a configured receiver — no component in this
|
||
project writes to anything outside nginx's own memory. The module does not emit log lines on the request path for the counter subsystem
|
||
and does not speak to any upstream.
|
||
|
||
## Components
|
||
|
||
### The nginx module
|
||
|
||
`ngx_http_ipng_stats_module` is the entire technical surface of this project. It is a single C module conforming to nginx's
|
||
dynamic-module ABI.
|
||
|
||
#### Responsibilities
|
||
|
||
- Parse new `listen` parameters `device=` and `ipng_source_tag=` and maintain a per-`(device, family)` → source-tag mapping that
|
||
also caches the resolved ifindex (FR-1.1, FR-1.2).
|
||
- Enable `IP_PKTINFO` / `IPV6_RECVPKTINFO` on every HTTP listening socket at `init_module` time so each accepted connection
|
||
carries an ingress-ifindex cmsg (FR-1.1, NFR-6.1).
|
||
- Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_class)` (FR-2.1, NFR-1.1).
|
||
- Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2).
|
||
- Update EWMAs at flush time (FR-2.4).
|
||
- Serve the scrape endpoint with content negotiation and optional filters (FR-3).
|
||
- Honor `ipng_stats on|off` at any config level (FR-5.5).
|
||
|
||
#### Attribution Model
|
||
|
||
The module's single novel idea is that per-device attribution rides on the kernel's `IP_PKTINFO` cmsg rather than on any userspace
|
||
inspection of packets. Each traffic source that should be tracked separately terminates on a dedicated interface on the nginx host;
|
||
the operator writes one `listen ... device=<ifname> ipng_source_tag=<tag>` line per `(family, interface)` pair. The module enables
|
||
`IP_PKTINFO` / `IPV6_RECVPKTINFO` on the listening socket so the kernel stashes an ingress-ifindex cmsg on every accepted
|
||
connection; at request-log time the module reads that cmsg via `getsockopt(IP_PKTOPTIONS)` on the accepted fd and looks the ifindex
|
||
up in its per-(device, family) table. Listening sockets are left as plain wildcards — no `SO_BINDTODEVICE` — which means
|
||
outgoing packets follow the normal routing table and the plugin composes cleanly with maglev / DSR setups, where the SYN arrives
|
||
via a GRE tunnel but the SYN-ACK must leave via the default route.
|
||
|
||
The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback
|
||
and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to.
|
||
|
||
Because the `device=` binding uses a wildcard address, the module does not need to know the set of VIPs served through each interface.
|
||
Adding a VIP (binding an address to `lo` and writing a new `server_name` block) does not require touching the `listen` lines. Adding a
|
||
new attributed interface does. This is the correct split: VIPs are vhost-level concerns and change often; interfaces are
|
||
infrastructure-level concerns and change rarely.
|
||
|
||
The design assumes interfaces used as `device=` sources carry **only** traffic from the expected source. Any other traffic arriving on
|
||
such an interface is silently misattributed to that interface's source tag. This is a deployment invariant, not a defect.
|
||
|
||
#### Counter Data Model
|
||
|
||
Counters are stored as a flat hash table in a shared-memory zone. The key is the tuple `(source_id, vip_id, status_class)` where
|
||
`source_id` and `vip_id` are small integers assigned at first observation and `status_class` is one of six values (`0=unknown`,
|
||
`1..5` for `1xx`..`5xx`). The value is a fixed-size record containing:
|
||
|
||
- `requests` (u64)
|
||
- `bytes_in` (u64)
|
||
- `bytes_out` (u64)
|
||
- `duration_sum_ms` (u64) — exported as `nginx_ipng_latency_total` (per class)
|
||
- `upstream_sum_ms` (u64)
|
||
- `duration_hist` — `B+1` u64 lanes (one per bucket plus the `+Inf` bucket)
|
||
- `upstream_hist` — same shape, only updated when an upstream served the request
|
||
- `bytes_in_hist`, `bytes_out_hist` — `Bb+1` u64 lanes over the byte-size bucket bounds
|
||
|
||
Histogram lanes are kept per `(source, vip, class)` in storage, then summed across classes at scrape time to produce one
|
||
`_bucket`/`_sum`/`_count` series per `(source, vip)` — the Prometheus exposition never carries a `code` label on histogram series
|
||
(FR-3.7).
|
||
|
||
A parallel table keyed by `(source_id, vip_id)` — one row per VIP — holds the EWMAs for instantaneous rate. EWMAs are floats but updated
|
||
only from the flush tick, so there is no float contention on the request path.
|
||
|
||
A second, smaller rbtree lives alongside the counter tree — one node per `(source_id, vip_id)` pair — holding three atomic gauge
|
||
lanes (`active`, `reading`, `writing`; FR-2.7). Unlike the counter path, gauges are updated from request lifecycle hooks
|
||
(`POST_READ`, header filter, pool cleanup) with atomic inc/dec directly on the shared node. The slab mutex is taken only the first
|
||
time a `(source, vip)` pair is seen; subsequent transitions on that pair are lock-free. Gauge nodes are never evicted — their
|
||
cardinality equals the number of distinct `(source, vip)` pairs and is small in practice.
|
||
|
||
The module also keeps a small string interning table for source and VIP strings, keyed by the integer IDs above, so that the scrape
|
||
endpoint can recover the original strings without re-parsing configuration.
|
||
|
||
String interning is capacity-bounded: the zone is sized by the operator, and once capacity is exhausted new keys are dropped with a
|
||
counter bump and an infrequent log line (NFR-3.1). In practice, the number of distinct VIPs on a single nginx host is small (tens, maybe
|
||
low hundreds), and the number of distinct source tags is the number of attributed interfaces (single digits). Because status codes are
|
||
collapsed to six classes (FR-2.6), the `status_class` dimension contributes at most 6× the `(source, vip)` count — a ~10× reduction
|
||
from the per-three-digit-code model considered and discarded.
|
||
|
||
#### Hot Path
|
||
|
||
The log-phase handler is deliberately short. Pseudocode:
|
||
|
||
```c
|
||
static ngx_int_t
|
||
ipng_stats_log_handler(ngx_http_request_t *r)
|
||
{
|
||
ipng_listen_ctx_t *lctx;
|
||
ipng_counter_t *counter;
|
||
ngx_msec_int_t elapsed_ms;
|
||
ngx_uint_t class_idx;
|
||
|
||
if (!ipng_stats_enabled(r)) {
|
||
return NGX_OK;
|
||
}
|
||
|
||
lctx = ngx_http_ipng_stats_listen_ctx(r->connection->listening);
|
||
/* lctx contains source_id and the cached VIP id,
|
||
or resolves VIP lazily on first seen address */
|
||
|
||
class_idx = ipng_status_to_class(r->headers_out.status); /* 0..5 */
|
||
counter = ipng_worker_slot(lctx, r->connection->local_sockaddr, class_idx);
|
||
|
||
counter->requests++;
|
||
counter->bytes_in += r->request_length;
|
||
counter->bytes_out += r->connection->sent;
|
||
|
||
elapsed_ms = (ngx_msec_int_t)(ngx_current_msec - r->start_msec);
|
||
ipng_hist_add(&counter->duration_hist, elapsed_ms);
|
||
counter->duration_sum_ms += elapsed_ms;
|
||
|
||
if (r->upstream_states && r->upstream_states->nelts > 0) {
|
||
ngx_msec_int_t up_ms = ipng_upstream_total_ms(r);
|
||
ipng_hist_add(&counter->upstream_hist, up_ms);
|
||
counter->upstream_sum_ms += up_ms;
|
||
}
|
||
|
||
return NGX_OK;
|
||
}
|
||
```
|
||
|
||
Nothing here touches shared memory. `ipng_worker_slot` resolves a private table slot using a small per-worker hash keyed by
|
||
`(source_id, vip_id, class_idx)`. VIP lookup is cached on the connection so that keep-alive requests reuse the resolved ID.
|
||
|
||
#### Flush Timer
|
||
|
||
At the interval configured by `ipng_stats_flush_interval` (default 1s), the worker:
|
||
|
||
1. Iterates its dirty-slot list (slots touched since the previous flush).
|
||
2. For each dirty slot, computes the delta versus the last flushed snapshot stored in the same slot.
|
||
3. Applies the delta to the shared-zone slot using 64-bit relaxed `fetch_add` on each counter lane.
|
||
4. Updates EWMAs from the delta.
|
||
5. Clears the dirty list (not the slot itself; slot state is preserved so the next flush can compute deltas again).
|
||
|
||
The worker never walks the entire table — only dirty slots — so idle VIPs cost nothing.
|
||
|
||
#### Scrape Handler
|
||
|
||
The `ipng_stats` handler is a leaf content handler. It:
|
||
|
||
1. Parses `?source_tag=` and `?vip=` into exact-match filters.
|
||
2. Parses `Accept:` to pick output format.
|
||
3. Walks the shared-memory zone under a shared lock (readers hold the read side of a rwlock; flushes and interners hold the write side
|
||
briefly).
|
||
4. Emits each matching key in the chosen format directly into an nginx chain buffer.
|
||
|
||
Output buffering and sending are standard nginx content handler code. The handler does not allocate during the walk; it uses a
|
||
fixed-size buffer per chain link and requests new links only when full.
|
||
|
||
#### Presents and Consumes
|
||
|
||
**Presents.**
|
||
|
||
- **One nginx content handler**, `ipng_stats`, usable in any `location` block. Serves Prometheus text and JSON, filtered by optional
|
||
query parameters.
|
||
- **Two new `listen` parameters**, `device=` and `ipng_source_tag=`, usable anywhere a `listen` directive is used.
|
||
- **Six new `http`-level directives**: `ipng_stats_zone`, `ipng_stats_flush_interval`, `ipng_stats_default_source`,
|
||
`ipng_stats_buckets`, `ipng_stats_byte_buckets`, `ipng_stats` (on/off).
|
||
- **A Prometheus metric family** prefixed `nginx_ipng_*`, labelled `source_tag`, `vip`, and (for counter metrics) a `code` class label
|
||
(`1xx`..`5xx`/`unknown`).
|
||
|
||
**Consumes.**
|
||
|
||
- **An nginx shared-memory zone** declared by `ipng_stats_zone`. The zone is allocated from nginx's own shared-memory pool.
|
||
- **Linux `IP_PKTINFO` / `IPV6_RECVPKTINFO` socket options** on the listening sockets, and the per-connection
|
||
`getsockopt(IP_PKTOPTIONS)` cmsg readback on the accepted fd.
|
||
- **The nginx log phase and connection structures** — standard module embedding, no private kernel calls.
|
||
|
||
### The Debian package
|
||
|
||
`libnginx-mod-http-ipng-stats` is the packaging wrapper. There is no ambition to build RPMs, Alpine packages, or a Homebrew formula;
|
||
Debian is the target and upstream nginx on Debian is the platform.
|
||
|
||
#### Responsibilities
|
||
|
||
- Build the module against the target release's nginx-dev headers with `--with-compat` (NFR-5.1, NFR-5.3).
|
||
- Install the compiled `.so` into `/usr/lib/nginx/modules` (FR-7.3).
|
||
- Drop a `load_module` stanza into `/etc/nginx/modules-available/` and enable it by default via a symlink in `modules-enabled/`
|
||
(FR-7.3).
|
||
- Sanity-check the resulting config with `nginx -t` in the postinst and back out cleanly if it fails (FR-7.4).
|
||
|
||
#### Build
|
||
|
||
The build is a plain `debian/rules` invocation that:
|
||
|
||
1. Fetches the nginx source for the installed `nginx-dev` version.
|
||
2. Runs `./configure --with-compat --add-dynamic-module=...` pointed at the module tree.
|
||
3. Builds only the module (`make modules`).
|
||
4. Installs the resulting `.so` into the package tree.
|
||
|
||
No nginx binary is produced, shipped, or touched. The package is strictly additive.
|
||
|
||
#### Presents and Consumes
|
||
|
||
**Presents.**
|
||
|
||
- **One Debian package** per supported release.
|
||
- **One dynamic module** loadable into stock upstream nginx.
|
||
|
||
**Consumes.**
|
||
|
||
- **The target release's `nginx-dev` package** at build time.
|
||
- **The running `nginx` package** at install time, for `nginx -t` validation.
|
||
|
||
## Operational Concerns
|
||
|
||
### Deployment Topology
|
||
|
||
A typical deployment on a single nginx host looks like:
|
||
|
||
- One interface per traffic source that should be separately attributed (e.g. GRE tunnels, VLANs), set up by the operator's networking
|
||
layer (systemd-networkd, Netplan, or a hand-rolled interface config). Interface names follow a consistent pattern, typically
|
||
`gre-<tag>` — e.g. `gre-mg1`, `gre-mg2`.
|
||
- VIPs bound to a local dummy or loopback interface so the kernel accepts packets destined for them.
|
||
- A hand-maintained `listen` include file with one device-bound listen per `(family, interface)` pair, reused across vhosts.
|
||
- Fallback `listen 80;` and `listen [::]:80;` in whichever server blocks serve direct web traffic.
|
||
- A single scrape location, e.g. `location = /.well-known/ipng/statsz`, served from a locked-down server block that only allows scrape consumers and
|
||
the local Prometheus scraper.
|
||
|
||
### Configuration
|
||
|
||
A minimal working configuration is about fifteen lines:
|
||
|
||
```nginx
|
||
load_module modules/ngx_http_ipng_stats_module.so;
|
||
|
||
http {
|
||
ipng_stats_zone ipng:4m;
|
||
|
||
server {
|
||
listen 80;
|
||
listen [::]:80;
|
||
include /etc/nginx/ipng-stats/listens.conf;
|
||
|
||
server_name _;
|
||
# ... normal vhost content
|
||
}
|
||
|
||
server {
|
||
listen 127.0.0.1:9113;
|
||
location = /.well-known/ipng/statsz {
|
||
ipng_stats;
|
||
allow 127.0.0.1;
|
||
allow 2001:db8::/48; # scrape consumers
|
||
deny all;
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
`listens.conf` is two lines per attributed interface (two address families each) and stable across vhost changes.
|
||
|
||
### Nginx Reload Semantics
|
||
|
||
`nginx -s reload` forks fresh workers, has old workers finish in-flight requests, and then shuts the old workers down. The plugin's
|
||
shared-memory zone is declared by name, which survives the reload; new workers attach to the same zone and continue accumulating
|
||
counters against the same keys. Counters MUST NOT reset on reload (NFR-4.1).
|
||
|
||
Source tags are recomputed from the new configuration on reload (NFR-4.3). Renaming a tag in configuration means new traffic appears
|
||
under the new name; the old name lingers in the zone until either operator restart or an LRU eviction policy ages it out (this is one
|
||
of the open questions below).
|
||
|
||
### Observability of the Plugin Itself
|
||
|
||
The plugin emits a handful of meta-metrics on the same scrape endpoint:
|
||
|
||
- `nginx_ipng_zone_bytes_used` / `nginx_ipng_zone_bytes_total` — zone high-water and capacity.
|
||
- `nginx_ipng_zone_full_events_total` — number of key insertions that were dropped because the zone was full.
|
||
- `nginx_ipng_flushes_total` — number of per-worker flush ticks that have run.
|
||
- `nginx_ipng_flush_duration_seconds` — histogram of flush durations.
|
||
- `nginx_ipng_scrape_duration_seconds` — histogram of scrape handler durations.
|
||
|
||
These make it possible to alert on "the module is running hot" and "the zone is full" without having to run a second scraper against
|
||
some other endpoint.
|
||
|
||
### Failure Modes
|
||
|
||
- **Shared zone full.** New keys are dropped, a counter is incremented, a rate-limited warning is logged, and the operator is expected
|
||
to resize the zone. Existing keys continue updating normally (NFR-3.1).
|
||
- **Worker crash.** The crashed worker's private counter deltas since its last flush are lost. The shared zone is unaffected. Since the
|
||
default flush interval is one second, the worst-case data loss is one second of that worker's traffic. This is acceptable for an
|
||
observability plane.
|
||
- **nginx master crash / package upgrade.** The shared zone is torn down with the old master. When the new master starts, the zone is
|
||
recreated empty. Counters start from zero. Consumers that need history SHOULD read from Prometheus, which retains history across
|
||
restarts.
|
||
- **Device disappears.** If an operator removes an interface without removing its `listen` line, nginx's bind will fail on the next
|
||
reload and the reload will error cleanly. The module does not hide this; a failing `nginx -t` is the right answer.
|
||
- **Traffic on a wildcard listener that should have been device-bound.** The traffic is counted under `direct` (or the configured
|
||
default). This is detectable: if the operator expects zero traffic under `direct` and the dashboard shows non-zero, an interface is
|
||
probably missing from the `listen` include.
|
||
- **Slow scrape on a large zone.** Scrape cost is linear in the number of keys (NFR-2.3). On a host with a very large VIP count, the
|
||
operator SHOULD increase the flush interval, lower the scrape frequency, or both. The module does not cap scrape runtime.
|
||
- **Scrape consumer is down.** The module is unaffected; its counters continue to increment and the Prometheus scrape continues to work.
|
||
When the consumer comes back, it resumes fetching. No state is lost.
|
||
|
||
### Security
|
||
|
||
- **Capabilities.** The module needs no capabilities beyond what nginx already has. `IP_PKTINFO` / `IPV6_RECVPKTINFO` on the
|
||
listening socket and `IP_PKTOPTIONS` read-back on the accepted fd are ordinary unprivileged socket options (NFR-6.1).
|
||
- **Scrape access control.** The operator MUST place the scrape `location` behind an `allow`/`deny` ACL. The module does not ship auth;
|
||
this is deliberate, and documented (NFR-6.2).
|
||
- **No PII.** The module records only aggregate per-VIP counters. Client IP, request path, headers, and cookies are not touched.
|
||
Access-log-style observation belongs in nginx's own access log (NFR-6.3).
|
||
- **Zone sizing as a soft DoS mitigation.** Because new keys are dropped when the zone is full rather than allocating unbounded memory,
|
||
a stream of bogus traffic cannot cause the module to exhaust nginx's memory. The tradeoff is that a real new VIP added after zone
|
||
exhaustion won't be tracked until the operator resizes — explicit and visible in the meta-metrics.
|
||
|
||
## Alternatives Considered
|
||
|
||
- **OpenResty + `lua-nginx-module` + `nginx-lua-prometheus`.** Rejected. Adds a large runtime dependency just for a narrow feature. The
|
||
deployment target is stock upstream nginx on Debian, and shipping an entirely different nginx build would defeat half the point of
|
||
packaging.
|
||
- **Access log tailing sidecar.** Rejected. Decoupled but introduces a second deploy unit, a log-rotation race, and a synchronization
|
||
gap between access log truncation and counter accuracy. Also loses live EWMAs.
|
||
- **`nginx-module-vts`.** Considered. VTS is a perfectly good general-purpose metric module, but it has no concept of "which ingress
|
||
interface did this request come in on", which is the entire innovation here. Adapting VTS to attribute by ingress interface would be a
|
||
bigger diff than writing a purpose-built module.
|
||
- **Attribution via CONNMARK on a single shared GRE tunnel.** Rejected after investigation. Netfilter loses the outer GRE source during
|
||
decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked, `SO_MARK` on an
|
||
accepted socket does not reflect incoming packet or conntrack mark without a per-packet `libnetfilter_conntrack` lookup, which is too
|
||
heavy for a log-phase handler.
|
||
- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than the `IP_PKTINFO` approach: it still
|
||
requires per-source tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `IP_PKTINFO` solves
|
||
the same problem with plain socket options nginx already knows about.
|
||
- **Attribution via `SO_BINDTODEVICE` on per-device listening sockets.** Rejected after a v0.4.0 release tested it in production:
|
||
`SO_BINDTODEVICE` also pins *egress* to the bound interface, including the SYN-ACK that the kernel sends from the listening
|
||
socket before `accept()` returns — which there is no hook to override in userspace. On DSR / maglev setups (SYN via a GRE
|
||
tunnel, SYN-ACK expected out the default route) the return packet goes back up the GRE tunnel and gets dropped by the uplink,
|
||
so no connection ever establishes. `IP_PKTINFO` gives us the same ingress identification at zero egress cost.
|
||
- **Attribution via eBPF `SO_REUSEPORT` programs.** Rejected as dramatic overkill for a problem the kernel already solves for free via
|
||
socket-lookup specificity.
|
||
- **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1 ipng_source_tag=mg1;`. The wildcard form works
|
||
because nginx routes by `server_name` post-accept, so the `listen` only needs to express `(port, device)` and does not need the VIP
|
||
address. This makes the generated include file size independent of the VIP count.
|
||
- **Pushing counters to an external daemon over gRPC.** Rejected. It complicates restart neutrality and adds a gRPC client dependency to
|
||
a C module. Pull-based scrape is simpler: consumers fetch when they want, and the module has no outbound connections.
|
||
- **Shipping separate JSON and Prometheus handlers.** Rejected. Content negotiation on one handler is simpler to configure and serves
|
||
both audiences from one ACL.
|
||
|
||
## Decisions Deferred Post-v0.1
|
||
|
||
- **Histogram bucket overrides per `source` or per `vip`.** v0.1 keeps FR-2.3's module-level set. If a single nginx instance ever serves
|
||
both latency-sensitive (API) and bulk (download) traffic on the same host such that one bucket set is too compromised, making buckets
|
||
per-`source` or per-`vip` is possible but multiplies memory and complicates Prometheus output.
|
||
- **TLS handshake metrics.** The module reports `request_duration` from the start of the HTTP request, not from TCP accept. For
|
||
TLS-terminating frontends a handshake-time fraction is invisible. Adding a `tls_handshake_duration` histogram is deferred until
|
||
operators ask for it.
|
||
- **Consumer fetch cadence.** Whichever cadence a consumer adopts for traffic counters — a one-second refresh, a longer Prometheus
|
||
scrape interval, or an SSE bridge layered on top — the plugin supports it. The choice is on the consumer side.
|