Update docs and header comments for the IP_PKTINFO attribution model

The SO_BINDTODEVICE → IP_PKTINFO switch in the previous commit
was a semantic change: the module no longer touches outgoing
routing at all, and several places in the docs and the module's
top-of-file comment still described the old mechanism.

- README.md and debian/control now describe attribution as
  reading the ingress ifindex per connection from the kernel's
  IP_PKTINFO / IPV6_PKTINFO cmsg, and explicitly call out that
  the DSR / maglev return-path constraint is what makes the
  change necessary.
- docs/design.md FR-1.1 / FR-1.5 / FR-1.6 are rewritten to
  forbid SO_BINDTODEVICE and to describe the cmsg-based lookup.
  NFR-6.1 notes these are ordinary unprivileged socket options.
  The "Components" / "Composes With" sections and the
  "Alternatives Considered" entry are brought in line — and a
  new entry records SO_BINDTODEVICE as a rejected alternative
  with the exact failure mode seen on an IPng production box.
- docs/config-guide.md already carried the new description;
  unchanged here.
- src/ngx_http_ipng_stats_module.c's top-level block comment is
  rewritten to match; the section header above init_module goes
  from "rebind listen sockets with SO_BINDTODEVICE" to "enable
  IP_PKTINFO on listen sockets, resolve ifindexes".

Three SO_BINDTODEVICE mentions deliberately remain in the source
and one in the design doc's alternatives table — all of them
explain that the module *avoids* the option, which is itself
load-bearing documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-18 15:16:10 +02:00
parent 31c2ac2d65
commit 055cf9f830
4 changed files with 65 additions and 41 deletions

View File

@@ -4,10 +4,13 @@
Per-VIP, per-device traffic counters for nginx. Ships as a dynamic nginx module and a Debian package that loads into stock upstream Per-VIP, per-device traffic counters for nginx. Ships as a dynamic nginx module and a Debian package that loads into stock upstream
nginx on Debian Trixie. nginx on Debian Trixie.
The module attributes every HTTP request to the interface it arrived on, using Linux `SO_BINDTODEVICE` on per-interface listening The module attributes every HTTP request to the interface it arrived on, reading the ingress `ifindex` per connection from the
sockets. Counters — requests, status codes, bytes, latency histograms — are exposed as Prometheus text or JSON from a single HTTP kernel's `IP_PKTINFO` / `IPV6_PKTINFO` cmsg. Listening sockets stay plain wildcards, so outgoing packets follow the normal
scrape endpoint, filtered per-source. This is useful for any deployment where traffic arrives on distinct interfaces — GRE tunnels, routing table — which is what makes this safe for DSR / maglev deployments where the SYN arrives via a GRE tunnel and the
VLANs, bonded links, or plain ethernet — and per-interface observability is needed. SYN-ACK must leave via the default route. Counters — requests, status codes, bytes, latency histograms — are exposed as
Prometheus text or JSON from a single HTTP scrape endpoint, filtered per-source. This is useful for any deployment where
traffic arrives on distinct interfaces — GRE tunnels, VLANs, bonded links, or plain ethernet — and per-interface observability
is needed.
Without any `device=`/`ipng_source_tag=` parameters, the module still counts and exposes per-VIP traffic under the configurable Without any `device=`/`ipng_source_tag=` parameters, the module still counts and exposes per-VIP traffic under the configurable
default source tag (`direct`), which makes it a useful plain observability module for any nginx host. default source tag (`direct`), which makes it a useful plain observability module for any nginx host.

9
debian/control vendored
View File

@@ -27,11 +27,12 @@ Description: nginx dynamic module for per-VIP, per-device traffic counters
request to the interface it arrived on. Counters are exposed as request to the interface it arrived on. Counters are exposed as
Prometheus text and JSON from a single scrape endpoint. Prometheus text and JSON from a single scrape endpoint.
. .
Attribution is done by the Linux kernel's TCP socket lookup, using Attribution is done by reading the ingress ifindex per connection
SO_BINDTODEVICE on per-interface listening sockets. The module adds from the kernel's IP_PKTINFO / IPV6_PKTINFO cmsg; listening sockets
stay plain wildcards so outgoing packets follow the normal routing
table (which matters for DSR / maglev setups). The module adds
device= and ipng_source_tag= parameters to the nginx listen device= and ipng_source_tag= parameters to the nginx listen
directive; the kernel routes each incoming connection to the directive, mapping interface names to source tags.
correct listener by ingress interface.
. .
Typical use cases include GRE tunnel fleets, VLAN trunks, or any Typical use cases include GRE tunnel fleets, VLAN trunks, or any
deployment where traffic arrives on distinct interfaces and deployment where traffic arrives on distinct interfaces and

View File

@@ -62,8 +62,8 @@ cannot provide. But the module is not coupled to that use case — it works with
access logs and existing log-analysis tools. access logs and existing log-analysis tools.
- The module does **not** provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on - The module does **not** provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on
restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus. restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus.
- The module does **not** own the interfaces, the VIP addresses, or the `SO_BINDTODEVICE` privilege. Interface creation, VIP binding, and - The module does **not** own the interfaces or the VIP addresses. Interface creation, VIP binding, and nginx master privileges
nginx master privileges are the operator's responsibility. are the operator's responsibility.
## Requirements ## Requirements
@@ -73,20 +73,24 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
**FR-1 Attribution** **FR-1 Attribution**
- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=<ifname>`, which causes the resulting - **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=<ifname>`, which records a mapping
listening socket to be created with `SO_BINDTODEVICE` set to the named interface. A listen directive without `device=` MUST create a between the named interface and the listen's source tag. Attribution at request time MUST read the ingress ifindex from the
plain listening socket as stock nginx does. kernel's `IP_PKTINFO` / `IPV6_PKTINFO` cmsg (enabled on every HTTP listening socket) and match against that mapping. The module
MUST NOT apply `SO_BINDTODEVICE` to the listening socket: the option pins both ingress and egress, which breaks DSR / maglev
deployments where the SYN arrives via a GRE tunnel but the SYN-ACK must leave via the default route. A listen directive without
`device=` MUST create a plain listening socket as stock nginx does.
- **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `ipng_source_tag=<tag>`, which attaches a short string tag to - **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `ipng_source_tag=<tag>`, which attaches a short string tag to
the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener. the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener.
- **FR-1.3** A listening socket with neither `device=` nor `ipng_source_tag=` MUST be tagged with the configured default source string (see - **FR-1.3** A listening socket with neither `device=` nor `ipng_source_tag=` MUST be tagged with the configured default source string (see
`ipng_stats_default_source`, FR-5.3). The default default is the literal string `direct`. `ipng_stats_default_source`, FR-5.3). The default default is the literal string `direct`.
- **FR-1.4** A listening socket with `device=X` but no `ipng_source_tag=` MUST be tagged with the interface name `X`. - **FR-1.4** A listening socket with `device=X` but no `ipng_source_tag=` MUST be tagged with the interface name `X`.
- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist, and the kernel's TCP socket lookup - **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist. Since no `SO_BINDTODEVICE`
rules MUST be relied on to dispatch each SYN to the most specific match. The module MUST NOT attempt to duplicate this logic in is applied, the kernel delivers all matching SYNs to a single wildcard listening socket and the module distinguishes them by
userspace. reading `ifindex` from the per-connection cmsg — so "multiple device-tagged listens at the same port" at the config level
- **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=<ifname>` MUST accept only collapses to one kernel socket at runtime without any userspace contortions.
connections whose ingress interface is `<ifname>`, for any local address served through that interface. This is the intended deployment - **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=<ifname>` MUST attribute
shape: wildcard fallback plus per-tunnel device-bound listeners. every connection whose ingress interface is `<ifname>` — regardless of which local address the client addressed — to that
listen's source tag. Traffic on other interfaces MUST fall back to the configured default source (see FR-1.3).
**FR-2 Counters** **FR-2 Counters**
@@ -268,9 +272,9 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
**NFR-6 Security** **NFR-6 Security**
- **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. The `SO_BINDTODEVICE` call is made - **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. `IP_PKTINFO` /
in the nginx master process which is already privileged during the bind step; workers never call `setsockopt(SO_BINDTODEVICE)` `IPV6_RECVPKTINFO` are enabled on the listening sockets (ordinary, unprivileged `setsockopt` calls) and the log handler's
themselves. `getsockopt(IP_PKTOPTIONS)` on the accepted fd is also unprivileged.
- **NFR-6.2** The scrape endpoint MUST be accessible only via an `allow`/`deny` ACL placed in the operator's nginx config. The module - **NFR-6.2** The scrape endpoint MUST be accessible only via an `allow`/`deny` ACL placed in the operator's nginx config. The module
MUST NOT ship its own authentication or authorization; it is a plain nginx handler and inherits the enclosing `location` block's access MUST NOT ship its own authentication or authorization; it is a plain nginx handler and inherits the enclosing `location` block's access
controls. controls.
@@ -339,8 +343,10 @@ dynamic-module ABI.
#### Responsibilities #### Responsibilities
- Parse new `listen` parameters `device=` and `ipng_source_tag=` and attach their values to each listening socket's config (FR-1.1, FR-1.2). - Parse new `listen` parameters `device=` and `ipng_source_tag=` and maintain a per-`(device, family)` → source-tag mapping that
- Call `setsockopt(SO_BINDTODEVICE)` in the master process at bind time for listeners that set `device=` (FR-1.1, NFR-6.1). also caches the resolved ifindex (FR-1.1, FR-1.2).
- Enable `IP_PKTINFO` / `IPV6_RECVPKTINFO` on every HTTP listening socket at `init_module` time so each accepted connection
carries an ingress-ifindex cmsg (FR-1.1, NFR-6.1).
- Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_class)` (FR-2.1, NFR-1.1). - Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_class)` (FR-2.1, NFR-1.1).
- Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2). - Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2).
- Update EWMAs at flush time (FR-2.4). - Update EWMAs at flush time (FR-2.4).
@@ -349,11 +355,14 @@ dynamic-module ABI.
#### Attribution Model #### Attribution Model
The module's single novel idea is that per-device attribution is done by the Linux kernel's TCP socket lookup, not by any userspace The module's single novel idea is that per-device attribution rides on the kernel's `IP_PKTINFO` cmsg rather than on any userspace
inspection. Each traffic source that should be tracked separately terminates on a dedicated interface on the nginx host; the operator inspection of packets. Each traffic source that should be tracked separately terminates on a dedicated interface on the nginx host;
writes one `listen ... device=<ifname> ipng_source_tag=<tag>` line per `(family, interface)` pair. The kernel binds that listening socket the operator writes one `listen ... device=<ifname> ipng_source_tag=<tag>` line per `(family, interface)` pair. The module enables
with `SO_BINDTODEVICE`, which causes it to match only connections whose ingress interface is that device. A wildcard `listen 80;` and `IP_PKTINFO` / `IPV6_RECVPKTINFO` on the listening socket so the kernel stashes an ingress-ifindex cmsg on every accepted
`listen [::]:80;` pair provides the fallback for traffic arriving on any other interface — typically normal web traffic. connection; at request-log time the module reads that cmsg via `getsockopt(IP_PKTOPTIONS)` on the accepted fd and looks the ifindex
up in its per-(device, family) table. Listening sockets are left as plain wildcards — no `SO_BINDTODEVICE` — which means
outgoing packets follow the normal routing table and the plugin composes cleanly with maglev / DSR setups, where the SYN arrives
via a GRE tunnel but the SYN-ACK must leave via the default route.
The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback
and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to. and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to.
@@ -482,7 +491,8 @@ fixed-size buffer per chain link and requests new links only when full.
**Consumes.** **Consumes.**
- **An nginx shared-memory zone** declared by `ipng_stats_zone`. The zone is allocated from nginx's own shared-memory pool. - **An nginx shared-memory zone** declared by `ipng_stats_zone`. The zone is allocated from nginx's own shared-memory pool.
- **The Linux `SO_BINDTODEVICE` socket option**, applied in the nginx master process during bind. - **Linux `IP_PKTINFO` / `IPV6_RECVPKTINFO` socket options** on the listening sockets, and the per-connection
`getsockopt(IP_PKTOPTIONS)` cmsg readback on the accepted fd.
- **The nginx log phase and connection structures** — standard module embedding, no private kernel calls. - **The nginx log phase and connection structures** — standard module embedding, no private kernel calls.
### The Debian package ### The Debian package
@@ -614,8 +624,8 @@ some other endpoint.
### Security ### Security
- **Capabilities.** The module needs no capabilities beyond what nginx already has. `SO_BINDTODEVICE` is called by the master during - **Capabilities.** The module needs no capabilities beyond what nginx already has. `IP_PKTINFO` / `IPV6_RECVPKTINFO` on the
bind; workers never call it (NFR-6.1). listening socket and `IP_PKTOPTIONS` read-back on the accepted fd are ordinary unprivileged socket options (NFR-6.1).
- **Scrape access control.** The operator MUST place the scrape `location` behind an `allow`/`deny` ACL. The module does not ship auth; - **Scrape access control.** The operator MUST place the scrape `location` behind an `allow`/`deny` ACL. The module does not ship auth;
this is deliberate, and documented (NFR-6.2). this is deliberate, and documented (NFR-6.2).
- **No PII.** The module records only aggregate per-VIP counters. Client IP, request path, headers, and cookies are not touched. - **No PII.** The module records only aggregate per-VIP counters. Client IP, request path, headers, and cookies are not touched.
@@ -638,9 +648,14 @@ some other endpoint.
decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked, `SO_MARK` on an decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked, `SO_MARK` on an
accepted socket does not reflect incoming packet or conntrack mark without a per-packet `libnetfilter_conntrack` lookup, which is too accepted socket does not reflect incoming packet or conntrack mark without a per-packet `libnetfilter_conntrack` lookup, which is too
heavy for a log-phase handler. heavy for a log-phase handler.
- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than `SO_BINDTODEVICE`: it still requires per-source - **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than the `IP_PKTINFO` approach: it still
tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `SO_BINDTODEVICE` solves the same problem with requires per-source tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `IP_PKTINFO` solves
kernel primitives nginx already knows about. the same problem with plain socket options nginx already knows about.
- **Attribution via `SO_BINDTODEVICE` on per-device listening sockets.** Rejected after a v0.4.0 release tested it in production:
`SO_BINDTODEVICE` also pins *egress* to the bound interface, including the SYN-ACK that the kernel sends from the listening
socket before `accept()` returns — which there is no hook to override in userspace. On DSR / maglev setups (SYN via a GRE
tunnel, SYN-ACK expected out the default route) the return packet goes back up the GRE tunnel and gets dropped by the uplink,
so no connection ever establishes. `IP_PKTINFO` gives us the same ingress identification at zero egress cost.
- **Attribution via eBPF `SO_REUSEPORT` programs.** Rejected as dramatic overkill for a problem the kernel already solves for free via - **Attribution via eBPF `SO_REUSEPORT` programs.** Rejected as dramatic overkill for a problem the kernel already solves for free via
socket-lookup specificity. socket-lookup specificity.
- **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1 ipng_source_tag=mg1;`. The wildcard form works - **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1 ipng_source_tag=mg1;`. The wildcard form works

View File

@@ -8,11 +8,16 @@
* See docs/design.md in the repository for the full design. The short * See docs/design.md in the repository for the full design. The short
* version is: * version is:
* *
* - Attribution is done by the Linux kernel's TCP socket lookup, via * - Attribution is done by reading the ingress ifindex per TCP
* SO_BINDTODEVICE on per-tunnel listening sockets. Each `listen` * connection from the kernel's IP_PKTINFO / IPV6_PKTINFO cmsg
* directive may carry `device=<ifname>` and `ipng_source_tag=<tag>` * (enabled on every HTTP listening socket at init_module time).
* parameter; this module parses them by replacing the stock * Listening sockets stay plain wildcards so egress follows the
* ngx_http_core_module `listen` command handler at preconfig time. * normal routing table — no SO_BINDTODEVICE, so DSR / maglev
* setups keep working. Each `listen` directive may carry
* `device=<ifname>` and `ipng_source_tag=<tag>` parameters; this
* module parses them by replacing the stock ngx_http_core_module
* `listen` command handler at preconfig time, and maintains the
* ifindex → source tag lookup table used by the log handler.
* *
* - Counters are maintained per-worker in a private table (no locks, * - Counters are maintained per-worker in a private table (no locks,
* no atomics on the request path) and flushed into a shared-memory * no atomics on the request path) and flushed into a shared-memory
@@ -1206,7 +1211,7 @@ ngx_http_ipng_stats_init_zone(ngx_shm_zone_t *shm_zone, void *data)
/* ----------------------------------------------------------------- */ /* ----------------------------------------------------------------- */
/* init_module: rebind listen sockets with SO_BINDTODEVICE */ /* init_module: enable IP_PKTINFO on listen sockets, resolve ifindexes */
/* ----------------------------------------------------------------- */ /* ----------------------------------------------------------------- */
/* init_module: enable IP_PKTINFO / IPV6_RECVPKTINFO on every HTTP /* init_module: enable IP_PKTINFO / IPV6_RECVPKTINFO on every HTTP