From 055cf9f830389a42e3b162388ed28f9ab1b0d369 Mon Sep 17 00:00:00 2001 From: Pim van Pelt Date: Sat, 18 Apr 2026 15:16:10 +0200 Subject: [PATCH] Update docs and header comments for the IP_PKTINFO attribution model MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The SO_BINDTODEVICE → IP_PKTINFO switch in the previous commit was a semantic change: the module no longer touches outgoing routing at all, and several places in the docs and the module's top-of-file comment still described the old mechanism. - README.md and debian/control now describe attribution as reading the ingress ifindex per connection from the kernel's IP_PKTINFO / IPV6_PKTINFO cmsg, and explicitly call out that the DSR / maglev return-path constraint is what makes the change necessary. - docs/design.md FR-1.1 / FR-1.5 / FR-1.6 are rewritten to forbid SO_BINDTODEVICE and to describe the cmsg-based lookup. NFR-6.1 notes these are ordinary unprivileged socket options. The "Components" / "Composes With" sections and the "Alternatives Considered" entry are brought in line — and a new entry records SO_BINDTODEVICE as a rejected alternative with the exact failure mode seen on an IPng production box. - docs/config-guide.md already carried the new description; unchanged here. - src/ngx_http_ipng_stats_module.c's top-level block comment is rewritten to match; the section header above init_module goes from "rebind listen sockets with SO_BINDTODEVICE" to "enable IP_PKTINFO on listen sockets, resolve ifindexes". Three SO_BINDTODEVICE mentions deliberately remain in the source and one in the design doc's alternatives table — all of them explain that the module *avoids* the option, which is itself load-bearing documentation. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 11 +++-- debian/control | 9 +++-- docs/design.md | 69 +++++++++++++++++++------------- src/ngx_http_ipng_stats_module.c | 17 +++++--- 4 files changed, 65 insertions(+), 41 deletions(-) diff --git a/README.md b/README.md index 44cdea8..c937428 100644 --- a/README.md +++ b/README.md @@ -4,10 +4,13 @@ Per-VIP, per-device traffic counters for nginx. Ships as a dynamic nginx module and a Debian package that loads into stock upstream nginx on Debian Trixie. -The module attributes every HTTP request to the interface it arrived on, using Linux `SO_BINDTODEVICE` on per-interface listening -sockets. Counters — requests, status codes, bytes, latency histograms — are exposed as Prometheus text or JSON from a single HTTP -scrape endpoint, filtered per-source. This is useful for any deployment where traffic arrives on distinct interfaces — GRE tunnels, -VLANs, bonded links, or plain ethernet — and per-interface observability is needed. +The module attributes every HTTP request to the interface it arrived on, reading the ingress `ifindex` per connection from the +kernel's `IP_PKTINFO` / `IPV6_PKTINFO` cmsg. Listening sockets stay plain wildcards, so outgoing packets follow the normal +routing table — which is what makes this safe for DSR / maglev deployments where the SYN arrives via a GRE tunnel and the +SYN-ACK must leave via the default route. Counters — requests, status codes, bytes, latency histograms — are exposed as +Prometheus text or JSON from a single HTTP scrape endpoint, filtered per-source. This is useful for any deployment where +traffic arrives on distinct interfaces — GRE tunnels, VLANs, bonded links, or plain ethernet — and per-interface observability +is needed. Without any `device=`/`ipng_source_tag=` parameters, the module still counts and exposes per-VIP traffic under the configurable default source tag (`direct`), which makes it a useful plain observability module for any nginx host. diff --git a/debian/control b/debian/control index f00412b..4cbc51e 100644 --- a/debian/control +++ b/debian/control @@ -27,11 +27,12 @@ Description: nginx dynamic module for per-VIP, per-device traffic counters request to the interface it arrived on. Counters are exposed as Prometheus text and JSON from a single scrape endpoint. . - Attribution is done by the Linux kernel's TCP socket lookup, using - SO_BINDTODEVICE on per-interface listening sockets. The module adds + Attribution is done by reading the ingress ifindex per connection + from the kernel's IP_PKTINFO / IPV6_PKTINFO cmsg; listening sockets + stay plain wildcards so outgoing packets follow the normal routing + table (which matters for DSR / maglev setups). The module adds device= and ipng_source_tag= parameters to the nginx listen - directive; the kernel routes each incoming connection to the - correct listener by ingress interface. + directive, mapping interface names to source tags. . Typical use cases include GRE tunnel fleets, VLAN trunks, or any deployment where traffic arrives on distinct interfaces and diff --git a/docs/design.md b/docs/design.md index 24598ed..fa31507 100644 --- a/docs/design.md +++ b/docs/design.md @@ -62,8 +62,8 @@ cannot provide. But the module is not coupled to that use case — it works with access logs and existing log-analysis tools. - The module does **not** provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus. -- The module does **not** own the interfaces, the VIP addresses, or the `SO_BINDTODEVICE` privilege. Interface creation, VIP binding, and - nginx master privileges are the operator's responsibility. +- The module does **not** own the interfaces or the VIP addresses. Interface creation, VIP binding, and nginx master privileges + are the operator's responsibility. ## Requirements @@ -73,20 +73,24 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat **FR-1 Attribution** -- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=`, which causes the resulting - listening socket to be created with `SO_BINDTODEVICE` set to the named interface. A listen directive without `device=` MUST create a - plain listening socket as stock nginx does. +- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=`, which records a mapping + between the named interface and the listen's source tag. Attribution at request time MUST read the ingress ifindex from the + kernel's `IP_PKTINFO` / `IPV6_PKTINFO` cmsg (enabled on every HTTP listening socket) and match against that mapping. The module + MUST NOT apply `SO_BINDTODEVICE` to the listening socket: the option pins both ingress and egress, which breaks DSR / maglev + deployments where the SYN arrives via a GRE tunnel but the SYN-ACK must leave via the default route. A listen directive without + `device=` MUST create a plain listening socket as stock nginx does. - **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `ipng_source_tag=`, which attaches a short string tag to the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener. - **FR-1.3** A listening socket with neither `device=` nor `ipng_source_tag=` MUST be tagged with the configured default source string (see `ipng_stats_default_source`, FR-5.3). The default default is the literal string `direct`. - **FR-1.4** A listening socket with `device=X` but no `ipng_source_tag=` MUST be tagged with the interface name `X`. -- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist, and the kernel's TCP socket lookup - rules MUST be relied on to dispatch each SYN to the most specific match. The module MUST NOT attempt to duplicate this logic in - userspace. -- **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=` MUST accept only - connections whose ingress interface is ``, for any local address served through that interface. This is the intended deployment - shape: wildcard fallback plus per-tunnel device-bound listeners. +- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist. Since no `SO_BINDTODEVICE` + is applied, the kernel delivers all matching SYNs to a single wildcard listening socket and the module distinguishes them by + reading `ifindex` from the per-connection cmsg — so "multiple device-tagged listens at the same port" at the config level + collapses to one kernel socket at runtime without any userspace contortions. +- **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=` MUST attribute + every connection whose ingress interface is `` — regardless of which local address the client addressed — to that + listen's source tag. Traffic on other interfaces MUST fall back to the configured default source (see FR-1.3). **FR-2 Counters** @@ -268,9 +272,9 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat **NFR-6 Security** -- **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. The `SO_BINDTODEVICE` call is made - in the nginx master process which is already privileged during the bind step; workers never call `setsockopt(SO_BINDTODEVICE)` - themselves. +- **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. `IP_PKTINFO` / + `IPV6_RECVPKTINFO` are enabled on the listening sockets (ordinary, unprivileged `setsockopt` calls) and the log handler's + `getsockopt(IP_PKTOPTIONS)` on the accepted fd is also unprivileged. - **NFR-6.2** The scrape endpoint MUST be accessible only via an `allow`/`deny` ACL placed in the operator's nginx config. The module MUST NOT ship its own authentication or authorization; it is a plain nginx handler and inherits the enclosing `location` block's access controls. @@ -339,8 +343,10 @@ dynamic-module ABI. #### Responsibilities -- Parse new `listen` parameters `device=` and `ipng_source_tag=` and attach their values to each listening socket's config (FR-1.1, FR-1.2). -- Call `setsockopt(SO_BINDTODEVICE)` in the master process at bind time for listeners that set `device=` (FR-1.1, NFR-6.1). +- Parse new `listen` parameters `device=` and `ipng_source_tag=` and maintain a per-`(device, family)` → source-tag mapping that + also caches the resolved ifindex (FR-1.1, FR-1.2). +- Enable `IP_PKTINFO` / `IPV6_RECVPKTINFO` on every HTTP listening socket at `init_module` time so each accepted connection + carries an ingress-ifindex cmsg (FR-1.1, NFR-6.1). - Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_class)` (FR-2.1, NFR-1.1). - Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2). - Update EWMAs at flush time (FR-2.4). @@ -349,11 +355,14 @@ dynamic-module ABI. #### Attribution Model -The module's single novel idea is that per-device attribution is done by the Linux kernel's TCP socket lookup, not by any userspace -inspection. Each traffic source that should be tracked separately terminates on a dedicated interface on the nginx host; the operator -writes one `listen ... device= ipng_source_tag=` line per `(family, interface)` pair. The kernel binds that listening socket -with `SO_BINDTODEVICE`, which causes it to match only connections whose ingress interface is that device. A wildcard `listen 80;` and -`listen [::]:80;` pair provides the fallback for traffic arriving on any other interface — typically normal web traffic. +The module's single novel idea is that per-device attribution rides on the kernel's `IP_PKTINFO` cmsg rather than on any userspace +inspection of packets. Each traffic source that should be tracked separately terminates on a dedicated interface on the nginx host; +the operator writes one `listen ... device= ipng_source_tag=` line per `(family, interface)` pair. The module enables +`IP_PKTINFO` / `IPV6_RECVPKTINFO` on the listening socket so the kernel stashes an ingress-ifindex cmsg on every accepted +connection; at request-log time the module reads that cmsg via `getsockopt(IP_PKTOPTIONS)` on the accepted fd and looks the ifindex +up in its per-(device, family) table. Listening sockets are left as plain wildcards — no `SO_BINDTODEVICE` — which means +outgoing packets follow the normal routing table and the plugin composes cleanly with maglev / DSR setups, where the SYN arrives +via a GRE tunnel but the SYN-ACK must leave via the default route. The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to. @@ -482,7 +491,8 @@ fixed-size buffer per chain link and requests new links only when full. **Consumes.** - **An nginx shared-memory zone** declared by `ipng_stats_zone`. The zone is allocated from nginx's own shared-memory pool. -- **The Linux `SO_BINDTODEVICE` socket option**, applied in the nginx master process during bind. +- **Linux `IP_PKTINFO` / `IPV6_RECVPKTINFO` socket options** on the listening sockets, and the per-connection + `getsockopt(IP_PKTOPTIONS)` cmsg readback on the accepted fd. - **The nginx log phase and connection structures** — standard module embedding, no private kernel calls. ### The Debian package @@ -614,8 +624,8 @@ some other endpoint. ### Security -- **Capabilities.** The module needs no capabilities beyond what nginx already has. `SO_BINDTODEVICE` is called by the master during - bind; workers never call it (NFR-6.1). +- **Capabilities.** The module needs no capabilities beyond what nginx already has. `IP_PKTINFO` / `IPV6_RECVPKTINFO` on the + listening socket and `IP_PKTOPTIONS` read-back on the accepted fd are ordinary unprivileged socket options (NFR-6.1). - **Scrape access control.** The operator MUST place the scrape `location` behind an `allow`/`deny` ACL. The module does not ship auth; this is deliberate, and documented (NFR-6.2). - **No PII.** The module records only aggregate per-VIP counters. Client IP, request path, headers, and cookies are not touched. @@ -638,9 +648,14 @@ some other endpoint. decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked, `SO_MARK` on an accepted socket does not reflect incoming packet or conntrack mark without a per-packet `libnetfilter_conntrack` lookup, which is too heavy for a log-phase handler. -- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than `SO_BINDTODEVICE`: it still requires per-source - tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `SO_BINDTODEVICE` solves the same problem with - kernel primitives nginx already knows about. +- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than the `IP_PKTINFO` approach: it still + requires per-source tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `IP_PKTINFO` solves + the same problem with plain socket options nginx already knows about. +- **Attribution via `SO_BINDTODEVICE` on per-device listening sockets.** Rejected after a v0.4.0 release tested it in production: + `SO_BINDTODEVICE` also pins *egress* to the bound interface, including the SYN-ACK that the kernel sends from the listening + socket before `accept()` returns — which there is no hook to override in userspace. On DSR / maglev setups (SYN via a GRE + tunnel, SYN-ACK expected out the default route) the return packet goes back up the GRE tunnel and gets dropped by the uplink, + so no connection ever establishes. `IP_PKTINFO` gives us the same ingress identification at zero egress cost. - **Attribution via eBPF `SO_REUSEPORT` programs.** Rejected as dramatic overkill for a problem the kernel already solves for free via socket-lookup specificity. - **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1 ipng_source_tag=mg1;`. The wildcard form works diff --git a/src/ngx_http_ipng_stats_module.c b/src/ngx_http_ipng_stats_module.c index 9628d11..702f0af 100644 --- a/src/ngx_http_ipng_stats_module.c +++ b/src/ngx_http_ipng_stats_module.c @@ -8,11 +8,16 @@ * See docs/design.md in the repository for the full design. The short * version is: * - * - Attribution is done by the Linux kernel's TCP socket lookup, via - * SO_BINDTODEVICE on per-tunnel listening sockets. Each `listen` - * directive may carry `device=` and `ipng_source_tag=` - * parameter; this module parses them by replacing the stock - * ngx_http_core_module `listen` command handler at preconfig time. + * - Attribution is done by reading the ingress ifindex per TCP + * connection from the kernel's IP_PKTINFO / IPV6_PKTINFO cmsg + * (enabled on every HTTP listening socket at init_module time). + * Listening sockets stay plain wildcards so egress follows the + * normal routing table — no SO_BINDTODEVICE, so DSR / maglev + * setups keep working. Each `listen` directive may carry + * `device=` and `ipng_source_tag=` parameters; this + * module parses them by replacing the stock ngx_http_core_module + * `listen` command handler at preconfig time, and maintains the + * ifindex → source tag lookup table used by the log handler. * * - Counters are maintained per-worker in a private table (no locks, * no atomics on the request path) and flushed into a shared-memory @@ -1206,7 +1211,7 @@ ngx_http_ipng_stats_init_zone(ngx_shm_zone_t *shm_zone, void *data) /* ----------------------------------------------------------------- */ -/* init_module: rebind listen sockets with SO_BINDTODEVICE */ +/* init_module: enable IP_PKTINFO on listen sockets, resolve ifindexes */ /* ----------------------------------------------------------------- */ /* init_module: enable IP_PKTINFO / IPV6_RECVPKTINFO on every HTTP