Update docs and header comments for the IP_PKTINFO attribution model
The SO_BINDTODEVICE → IP_PKTINFO switch in the previous commit was a semantic change: the module no longer touches outgoing routing at all, and several places in the docs and the module's top-of-file comment still described the old mechanism. - README.md and debian/control now describe attribution as reading the ingress ifindex per connection from the kernel's IP_PKTINFO / IPV6_PKTINFO cmsg, and explicitly call out that the DSR / maglev return-path constraint is what makes the change necessary. - docs/design.md FR-1.1 / FR-1.5 / FR-1.6 are rewritten to forbid SO_BINDTODEVICE and to describe the cmsg-based lookup. NFR-6.1 notes these are ordinary unprivileged socket options. The "Components" / "Composes With" sections and the "Alternatives Considered" entry are brought in line — and a new entry records SO_BINDTODEVICE as a rejected alternative with the exact failure mode seen on an IPng production box. - docs/config-guide.md already carried the new description; unchanged here. - src/ngx_http_ipng_stats_module.c's top-level block comment is rewritten to match; the section header above init_module goes from "rebind listen sockets with SO_BINDTODEVICE" to "enable IP_PKTINFO on listen sockets, resolve ifindexes". Three SO_BINDTODEVICE mentions deliberately remain in the source and one in the design doc's alternatives table — all of them explain that the module *avoids* the option, which is itself load-bearing documentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -62,8 +62,8 @@ cannot provide. But the module is not coupled to that use case — it works with
|
||||
access logs and existing log-analysis tools.
|
||||
- The module does **not** provide persistent storage. Counters live in shared memory for the lifetime of the nginx master process; on
|
||||
restart they start at zero. Consumers who need historical retention SHOULD scrape it from Prometheus.
|
||||
- The module does **not** own the interfaces, the VIP addresses, or the `SO_BINDTODEVICE` privilege. Interface creation, VIP binding, and
|
||||
nginx master privileges are the operator's responsibility.
|
||||
- The module does **not** own the interfaces or the VIP addresses. Interface creation, VIP binding, and nginx master privileges
|
||||
are the operator's responsibility.
|
||||
|
||||
## Requirements
|
||||
|
||||
@@ -73,20 +73,24 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
|
||||
|
||||
**FR-1 Attribution**
|
||||
|
||||
- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=<ifname>`, which causes the resulting
|
||||
listening socket to be created with `SO_BINDTODEVICE` set to the named interface. A listen directive without `device=` MUST create a
|
||||
plain listening socket as stock nginx does.
|
||||
- **FR-1.1** The module MUST support a new parameter on the nginx `listen` directive, `device=<ifname>`, which records a mapping
|
||||
between the named interface and the listen's source tag. Attribution at request time MUST read the ingress ifindex from the
|
||||
kernel's `IP_PKTINFO` / `IPV6_PKTINFO` cmsg (enabled on every HTTP listening socket) and match against that mapping. The module
|
||||
MUST NOT apply `SO_BINDTODEVICE` to the listening socket: the option pins both ingress and egress, which breaks DSR / maglev
|
||||
deployments where the SYN arrives via a GRE tunnel but the SYN-ACK must leave via the default route. A listen directive without
|
||||
`device=` MUST create a plain listening socket as stock nginx does.
|
||||
- **FR-1.2** The module MUST support a new parameter on the nginx `listen` directive, `ipng_source_tag=<tag>`, which attaches a short string tag to
|
||||
the listening socket. The tag is the dimension the scrape endpoint exports for every counter that came in on that listener.
|
||||
- **FR-1.3** A listening socket with neither `device=` nor `ipng_source_tag=` MUST be tagged with the configured default source string (see
|
||||
`ipng_stats_default_source`, FR-5.3). The default default is the literal string `direct`.
|
||||
- **FR-1.4** A listening socket with `device=X` but no `ipng_source_tag=` MUST be tagged with the interface name `X`.
|
||||
- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist, and the kernel's TCP socket lookup
|
||||
rules MUST be relied on to dispatch each SYN to the most specific match. The module MUST NOT attempt to duplicate this logic in
|
||||
userspace.
|
||||
- **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=<ifname>` MUST accept only
|
||||
connections whose ingress interface is `<ifname>`, for any local address served through that interface. This is the intended deployment
|
||||
shape: wildcard fallback plus per-tunnel device-bound listeners.
|
||||
- **FR-1.5** Two `listen` directives that share `address:port` but differ in `device=` MUST coexist. Since no `SO_BINDTODEVICE`
|
||||
is applied, the kernel delivers all matching SYNs to a single wildcard listening socket and the module distinguishes them by
|
||||
reading `ifindex` from the per-connection cmsg — so "multiple device-tagged listens at the same port" at the config level
|
||||
collapses to one kernel socket at runtime without any userspace contortions.
|
||||
- **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=<ifname>` MUST attribute
|
||||
every connection whose ingress interface is `<ifname>` — regardless of which local address the client addressed — to that
|
||||
listen's source tag. Traffic on other interfaces MUST fall back to the configured default source (see FR-1.3).
|
||||
|
||||
**FR-2 Counters**
|
||||
|
||||
@@ -268,9 +272,9 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat
|
||||
|
||||
**NFR-6 Security**
|
||||
|
||||
- **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. The `SO_BINDTODEVICE` call is made
|
||||
in the nginx master process which is already privileged during the bind step; workers never call `setsockopt(SO_BINDTODEVICE)`
|
||||
themselves.
|
||||
- **NFR-6.1** The module MUST NOT require any Linux capability beyond what stock nginx already needs. `IP_PKTINFO` /
|
||||
`IPV6_RECVPKTINFO` are enabled on the listening sockets (ordinary, unprivileged `setsockopt` calls) and the log handler's
|
||||
`getsockopt(IP_PKTOPTIONS)` on the accepted fd is also unprivileged.
|
||||
- **NFR-6.2** The scrape endpoint MUST be accessible only via an `allow`/`deny` ACL placed in the operator's nginx config. The module
|
||||
MUST NOT ship its own authentication or authorization; it is a plain nginx handler and inherits the enclosing `location` block's access
|
||||
controls.
|
||||
@@ -339,8 +343,10 @@ dynamic-module ABI.
|
||||
|
||||
#### Responsibilities
|
||||
|
||||
- Parse new `listen` parameters `device=` and `ipng_source_tag=` and attach their values to each listening socket's config (FR-1.1, FR-1.2).
|
||||
- Call `setsockopt(SO_BINDTODEVICE)` in the master process at bind time for listeners that set `device=` (FR-1.1, NFR-6.1).
|
||||
- Parse new `listen` parameters `device=` and `ipng_source_tag=` and maintain a per-`(device, family)` → source-tag mapping that
|
||||
also caches the resolved ifindex (FR-1.1, FR-1.2).
|
||||
- Enable `IP_PKTINFO` / `IPV6_RECVPKTINFO` on every HTTP listening socket at `init_module` time so each accepted connection
|
||||
carries an ingress-ifindex cmsg (FR-1.1, NFR-6.1).
|
||||
- Maintain per-worker private counter tables keyed by `(source_id, vip_id, status_class)` (FR-2.1, NFR-1.1).
|
||||
- Run a per-worker flush timer that moves deltas into the shared-memory zone atomically (FR-4.2, NFR-1.2).
|
||||
- Update EWMAs at flush time (FR-2.4).
|
||||
@@ -349,11 +355,14 @@ dynamic-module ABI.
|
||||
|
||||
#### Attribution Model
|
||||
|
||||
The module's single novel idea is that per-device attribution is done by the Linux kernel's TCP socket lookup, not by any userspace
|
||||
inspection. Each traffic source that should be tracked separately terminates on a dedicated interface on the nginx host; the operator
|
||||
writes one `listen ... device=<ifname> ipng_source_tag=<tag>` line per `(family, interface)` pair. The kernel binds that listening socket
|
||||
with `SO_BINDTODEVICE`, which causes it to match only connections whose ingress interface is that device. A wildcard `listen 80;` and
|
||||
`listen [::]:80;` pair provides the fallback for traffic arriving on any other interface — typically normal web traffic.
|
||||
The module's single novel idea is that per-device attribution rides on the kernel's `IP_PKTINFO` cmsg rather than on any userspace
|
||||
inspection of packets. Each traffic source that should be tracked separately terminates on a dedicated interface on the nginx host;
|
||||
the operator writes one `listen ... device=<ifname> ipng_source_tag=<tag>` line per `(family, interface)` pair. The module enables
|
||||
`IP_PKTINFO` / `IPV6_RECVPKTINFO` on the listening socket so the kernel stashes an ingress-ifindex cmsg on every accepted
|
||||
connection; at request-log time the module reads that cmsg via `getsockopt(IP_PKTOPTIONS)` on the accepted fd and looks the ifindex
|
||||
up in its per-(device, family) table. Listening sockets are left as plain wildcards — no `SO_BINDTODEVICE` — which means
|
||||
outgoing packets follow the normal routing table and the plugin composes cleanly with maglev / DSR setups, where the SYN arrives
|
||||
via a GRE tunnel but the SYN-ACK must leave via the default route.
|
||||
|
||||
The kernel's TCP listener lookup prefers a more-specific (device-matching) listener over a less-specific (wildcard) one, so the fallback
|
||||
and the device-bound listeners coexist without conflicts. The module does not need to duplicate this logic and does not try to.
|
||||
@@ -482,7 +491,8 @@ fixed-size buffer per chain link and requests new links only when full.
|
||||
**Consumes.**
|
||||
|
||||
- **An nginx shared-memory zone** declared by `ipng_stats_zone`. The zone is allocated from nginx's own shared-memory pool.
|
||||
- **The Linux `SO_BINDTODEVICE` socket option**, applied in the nginx master process during bind.
|
||||
- **Linux `IP_PKTINFO` / `IPV6_RECVPKTINFO` socket options** on the listening sockets, and the per-connection
|
||||
`getsockopt(IP_PKTOPTIONS)` cmsg readback on the accepted fd.
|
||||
- **The nginx log phase and connection structures** — standard module embedding, no private kernel calls.
|
||||
|
||||
### The Debian package
|
||||
@@ -614,8 +624,8 @@ some other endpoint.
|
||||
|
||||
### Security
|
||||
|
||||
- **Capabilities.** The module needs no capabilities beyond what nginx already has. `SO_BINDTODEVICE` is called by the master during
|
||||
bind; workers never call it (NFR-6.1).
|
||||
- **Capabilities.** The module needs no capabilities beyond what nginx already has. `IP_PKTINFO` / `IPV6_RECVPKTINFO` on the
|
||||
listening socket and `IP_PKTOPTIONS` read-back on the accepted fd are ordinary unprivileged socket options (NFR-6.1).
|
||||
- **Scrape access control.** The operator MUST place the scrape `location` behind an `allow`/`deny` ACL. The module does not ship auth;
|
||||
this is deliberate, and documented (NFR-6.2).
|
||||
- **No PII.** The module records only aggregate per-VIP counters. Client IP, request path, headers, and cookies are not touched.
|
||||
@@ -638,9 +648,14 @@ some other endpoint.
|
||||
decapsulation; the outer and inner conntrack entries are independent and mark does not cross. Even if tagging worked, `SO_MARK` on an
|
||||
accepted socket does not reflect incoming packet or conntrack mark without a per-packet `libnetfilter_conntrack` lookup, which is too
|
||||
heavy for a log-phase handler.
|
||||
- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than `SO_BINDTODEVICE`: it still requires per-source
|
||||
tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `SO_BINDTODEVICE` solves the same problem with
|
||||
kernel primitives nginx already knows about.
|
||||
- **Attribution via multiple GRE tunnels and CONNMARK.** Rejected as strictly worse than the `IP_PKTINFO` approach: it still
|
||||
requires per-source tunnels, still needs nginx to read the mark (hard), and adds a netfilter dependency. `IP_PKTINFO` solves
|
||||
the same problem with plain socket options nginx already knows about.
|
||||
- **Attribution via `SO_BINDTODEVICE` on per-device listening sockets.** Rejected after a v0.4.0 release tested it in production:
|
||||
`SO_BINDTODEVICE` also pins *egress* to the bound interface, including the SYN-ACK that the kernel sends from the listening
|
||||
socket before `accept()` returns — which there is no hook to override in userspace. On DSR / maglev setups (SYN via a GRE
|
||||
tunnel, SYN-ACK expected out the default route) the return packet goes back up the GRE tunnel and gets dropped by the uplink,
|
||||
so no connection ever establishes. `IP_PKTINFO` gives us the same ingress identification at zero egress cost.
|
||||
- **Attribution via eBPF `SO_REUSEPORT` programs.** Rejected as dramatic overkill for a problem the kernel already solves for free via
|
||||
socket-lookup specificity.
|
||||
- **Per-VIP enumeration in `listen` directives.** Rejected in favor of wildcard `listen 80 device=gre-mg1 ipng_source_tag=mg1;`. The wildcard form works
|
||||
|
||||
Reference in New Issue
Block a user