Switch per-device attribution from SO_BINDTODEVICE to IP_PKTINFO
SO_BINDTODEVICE pins both ingress *and* egress to the bound interface — the kernel uses the listening socket's device binding when choosing the output interface for the SYN-ACK, which is sent before accept() returns and therefore can't be fixed up in userspace. That's fatal for maglev / DSR deployments where the SYN arrives through a GRE tunnel but the return path has to leave via the default route; the SYN-ACK goes out the GRE and is dropped by the uplink, so every new connection times out. Rework the listen plumbing so the module never touches SO_BINDTODEVICE. init_module now enables IP_PKTINFO and IPV6_RECVPKTINFO on every HTTP listening socket and resolves each configured `device=` name to an ifindex. At request time resolve_source calls getsockopt(IP_PKTOPTIONS) on the accepted fd to read the per-connection in(6)_pktinfo cmsg the kernel stashed during the handshake, then matches (ifindex, family) against the bindings table. The listening sockets remain plain wildcards, so the return path follows the normal routing table and DSR works. The wrapper also no longer clones or rebinds sockets: it still dedups per (cscf, sockaddr) so multiple device-tagged listens in a single server block coexist, and dedups bindings on (device, family) so the same device can carry different tags for v4 and v6 (e.g. tag2-v4 / tag2-v6) but not pointlessly duplicate when a listen include is shared across server blocks. Drive-by fixes to unblock `make pkg-deb` after a prior `make build-asan`: - debian/rules overrides dh_clean to exclude build/, since nginx-asan's install creates nobody:0700 temp dirs dh_clean can't traverse. - Makefile's build-asan removes those unused runtime temp dirs so the tree is clean afterwards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -18,12 +18,13 @@ is invoked, so they compose with every standard `listen` parameter (`ssl`, `http
|
||||
|
||||
**Default:** not set (plain listen).
|
||||
|
||||
**Effect:** the resulting listening socket has `SO_BINDTODEVICE` applied at init-module time, making the kernel accept only connections
|
||||
whose ingress interface is `<ifname>`. Combined with a wildcard listen address (`80`, `[::]:80`) this is the mechanism by which the
|
||||
plugin attributes traffic to a specific ingress interface.
|
||||
**Effect:** records a binding between `<ifname>` and the listen's source tag. At request time the log handler reads the ingress
|
||||
ifindex for the connection (via `IP_PKTINFO` / `IPV6_PKTINFO` cmsg that the module enables on every HTTP listening socket at
|
||||
init-module time) and attributes the request to whichever binding matches. The listening socket itself is a plain wildcard — no
|
||||
`SO_BINDTODEVICE`, no extra sockets — which keeps outgoing packets on the default routing table and makes DSR / maglev
|
||||
deployments work.
|
||||
|
||||
The `setsockopt(SO_BINDTODEVICE)` call runs in the nginx master process while it still holds its initial privileges — workers never
|
||||
call it, and no additional Linux capability is required beyond what stock nginx already has (NFR-6.1).
|
||||
No additional Linux capability is required beyond what stock nginx already has (NFR-6.1).
|
||||
|
||||
See FR-1.1, FR-1.5, FR-1.6.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user