Update docs and header comments for the IP_PKTINFO attribution model

The SO_BINDTODEVICE → IP_PKTINFO switch in the previous commit
was a semantic change: the module no longer touches outgoing
routing at all, and several places in the docs and the module's
top-of-file comment still described the old mechanism.

- README.md and debian/control now describe attribution as
  reading the ingress ifindex per connection from the kernel's
  IP_PKTINFO / IPV6_PKTINFO cmsg, and explicitly call out that
  the DSR / maglev return-path constraint is what makes the
  change necessary.
- docs/design.md FR-1.1 / FR-1.5 / FR-1.6 are rewritten to
  forbid SO_BINDTODEVICE and to describe the cmsg-based lookup.
  NFR-6.1 notes these are ordinary unprivileged socket options.
  The "Components" / "Composes With" sections and the
  "Alternatives Considered" entry are brought in line — and a
  new entry records SO_BINDTODEVICE as a rejected alternative
  with the exact failure mode seen on an IPng production box.
- docs/config-guide.md already carried the new description;
  unchanged here.
- src/ngx_http_ipng_stats_module.c's top-level block comment is
  rewritten to match; the section header above init_module goes
  from "rebind listen sockets with SO_BINDTODEVICE" to "enable
  IP_PKTINFO on listen sockets, resolve ifindexes".

Three SO_BINDTODEVICE mentions deliberately remain in the source
and one in the design doc's alternatives table — all of them
explain that the module *avoids* the option, which is itself
load-bearing documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-18 15:16:10 +02:00
parent 31c2ac2d65
commit 055cf9f830
4 changed files with 65 additions and 41 deletions

View File

@@ -62,8 +62,8 @@ cannot provide. But the module is not coupled to that use case — it works with
access logs and existing log-analysis tools.
- The module does **not** provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on
restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus.
- The module does **not** own the interfaces, the VIP addresses, or the `SO_BINDTODEVICE` privilege. Interface creation, VIP binding, and
nginx master privileges are the operator's responsibility.
- The module does **not** own the interfaces or the VIP addresses. Interface creation, VIP binding, and nginx master privileges
are the operator's responsibility.
## Requirements
@@ -73,20 +73,24 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
**FR-1 Attribution**
- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=<ifname>`, which causes the resulting
listening socket to be created with `SO_BINDTODEVICE` set to the named interface. A listen directive without `device=` MUST create a
plain listening socket as stock nginx does.
- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=<ifname>`, which records a mapping
between the named interface and the listen's source tag. Attribution at request time MUST read the ingress ifindex from the
kernel's `IP_PKTINFO` / `IPV6_PKTINFO` cmsg (enabled on every HTTP listening socket) and match against that mapping. The module
MUST NOT apply `SO_BINDTODEVICE` to the listening socket: the option pins both ingress and egress, which breaks DSR / maglev
deployments where the SYN arrives via a GRE tunnel but the SYN-ACK must leave via the default route. A listen directive without
`device=` MUST create a plain listening socket as stock nginx does.
- **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `ipng_source_tag=<tag>`, which attaches a short string tag to
the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener.
- **FR-1.3** A listening socket with neither `device=` nor `ipng_source_tag=` MUST be tagged with the configured default source string (see
`ipng_stats_default_source`, FR-5.3). The default default is the literal string `direct`.
- **FR-1.4** A listening socket with `device=X` but no `ipng_source_tag=` MUST be tagged with the interface name `X`.
- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist, and the kernel's TCP socket lookup
rules MUST be relied on to dispatch each SYN to the most specific match. The module MUST NOT attempt to duplicate this logic in
userspace.
- **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=<ifname>` MUST accept only
connections whose ingress interface is `<ifname>`, for any local address served through that interface. This is the intended deployment
shape: wildcard fallback plus per-tunnel device-bound listeners.
- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist. Since no `SO_BINDTODEVICE`
is applied, the kernel delivers all matching SYNs to a single wildcard listening socket and the module distinguishes them by
reading `ifindex` from the per-connection cmsg — so "multiple device-tagged listens at the same port" at the config level
collapses to one kernel socket at runtime without any userspace contortions.
- **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=<ifname>` MUST attribute
every connection whose ingress interface is `<ifname>` — regardless of which local address the client addressed — to that
listen's source tag. Traffic on other interfaces MUST fall back to the configured default source (see FR-1.3).
**FR-2 Counters**
@@ -268,9 +272,9 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
**NFR-6 Security**
- **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. The `SO_BINDTODEVICE` call is made
in the nginx master process which is already privileged during the bind step; workers never call `setsockopt(SO_BINDTODEVICE)`
themselves.
- **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. `IP_PKTINFO` /
`IPV6_RECVPKTINFO` are enabled on the listening sockets (ordinary, unprivileged `setsockopt` calls) and the log handler's
`getsockopt(IP_PKTOPTIONS)` on the accepted fd is also unprivileged.
- **NFR-6.2** The scrape endpoint MUST be accessible only via an `allow`/`deny` ACL placed in the operator's nginx config. The module
MUST NOT ship its own authentication or authorization; it is a plain nginx handler and inherits the enclosing `location` block's access
controls.
@@ -339,8 +343,10 @@ dynamic-module ABI.
#### Responsibilities
- Parse new `listen` parameters `device=` and `ipng_source_tag=` and attach their values to each listening socket's config (FR-1.1, FR-1.2).
- Call `setsockopt(SO_BINDTODEVICE)` in the master process at bind time for listeners that set `device=` (FR-1.1, NFR-6.1).
- Parse new `listen` parameters `device=` and `ipng_source_tag=` and maintain a per-`(device, family)` → source-tag mapping that
also caches the resolved ifindex (FR-1.1, FR-1.2).
- Enable `IP_PKTINFO` / `IPV6_RECVPKTINFO` on every HTTP listening socket at `init_module` time so each accepted connection
carries an ingress-ifindex cmsg (FR-1.1, NFR-6.1).
- Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_class)` (FR-2.1, NFR-1.1).
- Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2).
- Update EWMAs at flush time (FR-2.4).
@@ -349,11 +355,14 @@ dynamic-module ABI.
#### Attribution Model
The module's single novel idea is that per-device attribution is done by the Linux kernel's TCP socket lookup, not by any userspace
inspection. Each traffic source that should be tracked separately terminates on a dedicated interface on the nginx host; the operator
writes one `listen ... device=<ifname> ipng_source_tag=<tag>` line per `(family, interface)` pair. The kernel binds that listening socket
with `SO_BINDTODEVICE`, which causes it to match only connections whose ingress interface is that device. A wildcard `listen 80;` and
`listen [::]:80;` pair provides the fallback for traffic arriving on any other interface — typically normal web traffic.
The module's single novel idea is that per-device attribution rides on the kernel's `IP_PKTINFO` cmsg rather than on any userspace
inspection of packets. Each traffic source that should be tracked separately terminates on a dedicated interface on the nginx host;
the operator writes one `listen ... device=<ifname> ipng_source_tag=<tag>` line per `(family, interface)` pair. The module enables
`IP_PKTINFO` / `IPV6_RECVPKTINFO` on the listening socket so the kernel stashes an ingress-ifindex cmsg on every accepted
connection; at request-log time the module reads that cmsg via `getsockopt(IP_PKTOPTIONS)` on the accepted fd and looks the ifindex
up in its per-(device, family) table. Listening sockets are left as plain wildcards — no `SO_BINDTODEVICE` — which means
outgoing packets follow the normal routing table and the plugin composes cleanly with maglev / DSR setups, where the SYN arrives
via a GRE tunnel but the SYN-ACK must leave via the default route.
The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback
and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to.
@@ -482,7 +491,8 @@ fixed-size buffer per chain link and requests new links only when full.
**Consumes.**
- **An nginx shared-memory zone** declared by `ipng_stats_zone`. The zone is allocated from nginx's own shared-memory pool.
- **The Linux `SO_BINDTODEVICE` socket option**, applied in the nginx master process during bind.
- **Linux `IP_PKTINFO` / `IPV6_RECVPKTINFO` socket options** on the listening sockets, and the per-connection
`getsockopt(IP_PKTOPTIONS)` cmsg readback on the accepted fd.
- **The nginx log phase and connection structures** — standard module embedding, no private kernel calls.
### The Debian package
@@ -614,8 +624,8 @@ some other endpoint.
### Security
- **Capabilities.** The module needs no capabilities beyond what nginx already has. `SO_BINDTODEVICE` is called by the master during
bind; workers never call it (NFR-6.1).
- **Capabilities.** The module needs no capabilities beyond what nginx already has. `IP_PKTINFO` / `IPV6_RECVPKTINFO` on the
listening socket and `IP_PKTOPTIONS` read-back on the accepted fd are ordinary unprivileged socket options (NFR-6.1).
- **Scrape access control.** The operator MUST place the scrape `location` behind an `allow`/`deny` ACL. The module does not ship auth;
this is deliberate, and documented (NFR-6.2).
- **No PII.** The module records only aggregate per-VIP counters. Client IP, request path, headers, and cookies are not touched.
@@ -638,9 +648,14 @@ some other endpoint.
decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked, `SO_MARK` on an
accepted socket does not reflect incoming packet or conntrack mark without a per-packet `libnetfilter_conntrack` lookup, which is too
heavy for a log-phase handler.
- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than `SO_BINDTODEVICE`: it still requires per-source
tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `SO_BINDTODEVICE` solves the same problem with
kernel primitives nginx already knows about.
- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than the `IP_PKTINFO` approach: it still
requires per-source tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `IP_PKTINFO` solves
the same problem with plain socket options nginx already knows about.
- **Attribution via `SO_BINDTODEVICE` on per-device listening sockets.** Rejected after a v0.4.0 release tested it in production:
`SO_BINDTODEVICE` also pins *egress* to the bound interface, including the SYN-ACK that the kernel sends from the listening
socket before `accept()` returns — which there is no hook to override in userspace. On DSR / maglev setups (SYN via a GRE
tunnel, SYN-ACK expected out the default route) the return packet goes back up the GRE tunnel and gets dropped by the uplink,
so no connection ever establishes. `IP_PKTINFO` gives us the same ingress identification at zero egress cost.
- **Attribution via eBPF `SO_REUSEPORT` programs.** Rejected as dramatic overkill for a problem the kernel already solves for free via
socket-lookup specificity.
- **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1 ipng_source_tag=mg1;`. The wildcard form works