Switch per-device attribution from SO_BINDTODEVICE to IP_PKTINFO

SO_BINDTODEVICE pins both ingress *and* egress to the bound
interface — the kernel uses the listening socket's device
binding when choosing the output interface for the SYN-ACK,
which is sent before accept() returns and therefore can't be
fixed up in userspace. That's fatal for maglev / DSR
deployments where the SYN arrives through a GRE tunnel but the
return path has to leave via the default route; the SYN-ACK
goes out the GRE and is dropped by the uplink, so every new
connection times out.

Rework the listen plumbing so the module never touches
SO_BINDTODEVICE. init_module now enables IP_PKTINFO and
IPV6_RECVPKTINFO on every HTTP listening socket and resolves
each configured `device=` name to an ifindex. At request time
resolve_source calls getsockopt(IP_PKTOPTIONS) on the accepted
fd to read the per-connection in(6)_pktinfo cmsg the kernel
stashed during the handshake, then matches (ifindex, family)
against the bindings table. The listening sockets remain plain
wildcards, so the return path follows the normal routing table
and DSR works.

The wrapper also no longer clones or rebinds sockets: it still
dedups per (cscf, sockaddr) so multiple device-tagged listens
in a single server block coexist, and dedups bindings on
(device, family) so the same device can carry different tags
for v4 and v6 (e.g. tag2-v4 / tag2-v6) but not pointlessly
duplicate when a listen include is shared across server blocks.

Drive-by fixes to unblock `make pkg-deb` after a prior
`make build-asan`:
- debian/rules overrides dh_clean to exclude build/, since
  nginx-asan's install creates nobody:0700 temp dirs dh_clean
  can't traverse.
- Makefile's build-asan removes those unused runtime temp dirs
  so the tree is clean afterwards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-18 14:39:01 +02:00
parent cf7a538ee6
commit 450391af6b
6 changed files with 202 additions and 328 deletions

View File

@@ -18,12 +18,13 @@ is invoked, so they compose with every standard `listen` parameter (`ssl`, `http
**Default:** not set (plain listen).
**Effect:** the resulting listening socket has `SO_BINDTODEVICE` applied at init-module time, making the kernel accept only connections
whose ingress interface is `<ifname>`. Combined with a wildcard listen address (`80`, `[::]:80`) this is the mechanism by which the
plugin attributes traffic to a specific ingress interface.
**Effect:** records a binding between `<ifname>` and the listen's source tag. At request time the log handler reads the ingress
ifindex for the connection (via `IP_PKTINFO` / `IPV6_PKTINFO` cmsg that the module enables on every HTTP listening socket at
init-module time) and attributes the request to whichever binding matches. The listening socket itself is a plain wildcard — no
`SO_BINDTODEVICE`, no extra sockets — which keeps outgoing packets on the default routing table and makes DSR / maglev
deployments work.
The `setsockopt(SO_BINDTODEVICE)` call runs in the nginx master process while it still holds its initial privileges — workers never
call it, and no additional Linux capability is required beyond what stock nginx already has (NFR-6.1).
No additional Linux capability is required beyond what stock nginx already has (NFR-6.1).
See FR-1.1, FR-1.5, FR-1.6.

View File

@@ -42,8 +42,11 @@ nginx -V 2>&1 | grep -o ngx_http_ipng_stats_module
## 2. Set up interfaces for per-device attribution
The plugin attributes traffic by watching which interface the request came in on, using `SO_BINDTODEVICE` on per-interface listening
sockets. For this to work, each traffic source that should be tracked separately MUST arrive on its own interface.
The plugin attributes traffic by watching which interface the request came in on. It enables `IP_PKTINFO` / `IPV6_RECVPKTINFO` on
each listening socket and reads the ingress `ifindex` per accepted connection, so the listening sockets remain plain wildcards and
outgoing packets follow the normal routing table — this is what makes DSR / maglev deployments work, where the SYN arrives via a
GRE tunnel but the SYN-ACK must leave via the default route. For the attribution itself to work, each traffic source that should be
tracked separately MUST arrive on its own interface.
This works with any kind of Linux interface — GRE tunnels, VLANs, VXLANs, bonded links, or plain ethernet. This guide uses GRE
tunnels as the example, but the module does not care about the interface type.
@@ -175,11 +178,10 @@ VIP is a `server_name` change; adding a new interface is an append to `listens.c
### All listens on a shared port must be device-tagged
If you use multiple `listen` directives on the same port (e.g. port 80), **every one of them must carry `device=<ifname>`**. Mixing a
device-pinned listen with a plain `listen 80;` or with an address-specific `listen 192.0.2.1:80;` on the same port is **not
supported** and nginx will fail to start. This is a kernel-level limitation: a device-pinned socket sets `SO_BINDTODEVICE` before
`bind(2)`, while a plain wildcard socket sets no device filter — Linux refuses to hold both on the same `(addr, port)` tuple, so
the second bind fails with `EADDRINUSE` regardless of what the nginx config-level dedup might do.
If you use multiple `listen` directives on the same port (e.g. port 80), **every one of them must carry `device=<ifname>`**. Mixing
a device-tagged listen with a plain `listen 80;` or with an address-specific `listen 192.0.2.1:80;` on the same port is **not
supported** nginx's config-level dedup rejects same-sockaddr listens within a server block, and the module's wrapper only
exempts directives that carry `device=`.
For "direct" traffic — clients hitting the host on a non-attributed interface — use a **separate port** on the direct interface
(e.g. `listen 198.51.100.1:8081;`). That listen then has no `device=`, so it falls back to the tag set by