From 7ed77f5b22fb252dee968e1716a806f5008cd74e Mon Sep 17 00:00:00 2001 From: Pim van Pelt Date: Sun, 19 Apr 2026 16:23:58 +0200 Subject: [PATCH] Strip socket options on cross-cscf repeat listens (v0.7.2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Make the shared-listen-include pattern work with `reuseport` and the other socket-level listen options. Nginx core enforces at-most-once per sockaddr on options that set lsopt.set=1 (reuseport, bind, backlog=, rcvbuf=, sndbuf=, setfib=, fastopen=, accept_filter=, deferred, ipv6only=, so_keepalive=) and emits "duplicate listen options for " otherwise. That rule collides with a single listens.conf included from every vhost — each vhost's include re-submits the same options. The listen wrapper now detects the cross-cscf case, strips those options from cf->args before delegating to the core handler, and logs one notice per stripped listen. The first cscf owns the options on the kernel socket; later cscfs merge cleanly via ngx_http_add_server. Protocol-level flags (ssl, http2, quic, proxy_protocol) pass through untouched since nginx OR-merges those across cscfs. This unblocks `reuseport` for deployments that want better new-connection spread across workers. Co-Authored-By: Claude Opus 4.7 (1M context) --- Makefile | 2 +- debian/changelog | 20 +++++ docs/design.md | 8 ++ docs/user-guide.md | 25 ++++++ src/ngx_http_ipng_stats_module.c | 85 ++++++++++++++++++++- tests/01-module/01-e2e.robot | 22 ++++-- tests/01-module/lab/server/ipng-listens.inc | 15 +++- tests/01-module/lab/server/nginx.conf | 9 ++- 8 files changed, 168 insertions(+), 18 deletions(-) diff --git a/Makefile b/Makefile index f497c67..8848eb3 100644 --- a/Makefile +++ b/Makefile @@ -23,7 +23,7 @@ BUILD_DIR := $(CURDIR)/build # the package version from there directly. The C code picks up VERSION # via the generated src/version.h (written by the version-header target # below and depended on by the module build). -VERSION := 0.7.1 +VERSION := 0.7.2 NGINX_SRC ?= diff --git a/debian/changelog b/debian/changelog index 2fe8c9e..e5db033 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,3 +1,23 @@ +nginx-ipng-stats-plugin (0.7.2-1) unstable; urgency=medium + + * Pre-release v0.7.2. + - Listen wrapper now strips socket-level options (reuseport, + bind, backlog=, rcvbuf=, sndbuf=, setfib=, fastopen=, + accept_filter=, deferred, ipv6only=, so_keepalive=) from + cf->args when a sockaddr recurs under a different server + block. Previously this pattern — typical with a shared + `include listens.conf;` across vhosts — tripped nginx's + "duplicate listen options for " check because those + options are one-shot per kernel socket. The first cscf now + owns the options on the shared kernel socket and later + cscfs merge cleanly via ngx_http_add_server. A NOTICE is + logged each time socket options are stripped. This makes + `reuseport` usable with the shared-include deployment + pattern from docs/user-guide.md, which helps worker + load-balancing on busy hosts. + + -- Pim van Pelt Sun, 19 Apr 2026 16:30:00 +0200 + nginx-ipng-stats-plugin (0.7.1-1) unstable; urgency=medium * Pre-release v0.7.1. diff --git a/docs/design.md b/docs/design.md index ba75d0b..ccf1e9d 100644 --- a/docs/design.md +++ b/docs/design.md @@ -90,6 +90,14 @@ Each requirement carries a unique identifier (`FR-X.Y` or `NFR-X.Y`) so that lat `(server block, sockaddr)` across both plain and device-tagged listens: the first occurrence registers the kernel socket, and subsequent same-sockaddr siblings (plain or device-tagged) are accepted without tripping nginx's duplicate-listen check. Device-tagged siblings additionally register an entry in the attribution table. +- **FR-1.5a** When the *same* sockaddr is listen'd from a *different* `server { ... }` block — the shared `include` pattern — + the wrapper MUST strip socket-level options from `cf->args` before delegating to the core listen handler. These options + (`reuseport`, `bind`, `backlog=`, `rcvbuf=`, `sndbuf=`, `setfib=`, `fastopen=`, `accept_filter=`, `deferred`, `ipv6only=`, + `so_keepalive=`) apply to the one kernel socket that backs the sockaddr and nginx rejects any attempt to set them more + than once per sockaddr with `duplicate listen options for ` (see `ngx_http_add_addresses` in `src/http/ngx_http.c`). + The first cscf to hit the sockaddr owns the options; subsequent cscfs pass the core handler with the options removed and + merge via `ngx_http_add_server`. Protocol-level flags (`ssl`, `http2`, `quic`, `proxy_protocol`) are preserved on every call + because nginx merges them with OR semantics across cscfs. - **FR-1.6** A `listen` directive that uses a wildcard address (`80`, `[::]:80`) together with `device=` MUST attribute every connection whose ingress interface is `` — regardless of which local address the client addressed — to that listen's source tag. Traffic on other interfaces MUST fall back to the configured default source (see FR-1.3). diff --git a/docs/user-guide.md b/docs/user-guide.md index d2ceb55..937ecc5 100644 --- a/docs/user-guide.md +++ b/docs/user-guide.md @@ -199,6 +199,31 @@ register bindings without tripping nginx's duplicate-listen check. Traffic arriv back to `ipng_stats_default_source` (`direct` by default). Keeping "direct" traffic on its own port — e.g. `listen 198.51.100.1:8081;` — remains a fine pattern when you want a hard split, but it's no longer required. +### Shared includes with `reuseport` (or other socket-level options) + +Socket-level `listen` options — `reuseport`, `bind`, `backlog=`, `rcvbuf=`, `sndbuf=`, `setfib=`, `fastopen=`, `accept_filter=`, +`deferred`, `ipv6only=`, `so_keepalive=` — belong to the one kernel socket that backs a given sockaddr, not to a particular +`server { ... }` block. Stock nginx enforces this by accepting them on at most the *first* listen per sockaddr and emitting +`duplicate listen options for ` on any subsequent repeat. That rule collides with the common deployment pattern of a single +`listens.conf` included from every vhost, because each vhost's `include` re-submits the same options. + +The wrapper resolves this transparently. When a sockaddr recurs under a different `server` block than the one that first +registered it, the wrapper strips socket-level options from the incoming `cf->args` before delegating to nginx's core listen +handler. The first `server` block owns the options on the kernel socket (including `reuseport`, which triggers per-worker +socket cloning); later blocks merge cleanly via `ngx_http_add_server` and inherit the same socket. The wrapper logs one +`[notice] ipng_stats: stripped socket options from duplicate listen on ` per stripped listen — informational, not an +error. So this include works unchanged across as many vhosts as you like: + +```nginx +listen 443 ssl reuseport device=gre-mg1 ipng_source_tag=mg1; +listen [::]:443 ssl reuseport device=gre-mg1 ipng_source_tag=mg1; +``` + +`reuseport` noticeably helps worker load-balancing on busy hosts: without it, a single shared listening socket forces workers +to compete for accepts and traffic routinely concentrates on one or two workers. HTTP/2 and long-lived keepalive connections +can still skew CPU toward whichever worker holds a few heavy clients — `reuseport` does not reshuffle existing connections — +but new-connection distribution across workers becomes kernel-hashed, not first-ready-wins. + ## 4. Verify with curl Generate some traffic (or wait for real traffic), then scrape the endpoint locally: diff --git a/src/ngx_http_ipng_stats_module.c b/src/ngx_http_ipng_stats_module.c index 751659d..c575b84 100644 --- a/src/ngx_http_ipng_stats_module.c +++ b/src/ngx_http_ipng_stats_module.c @@ -499,6 +499,50 @@ ngx_http_ipng_stats_preconfig(ngx_conf_t *cf) } +/* Return 1 if `a` is a listen parameter that makes nginx set + * `lsopt.set = 1` (i.e. a socket-level option that nginx enforces + * at-most-once per sockaddr via its "duplicate listen options" check + * in ngx_http_add_addresses). Keep in sync with nginx core's + * ngx_http_core_listen — currently: bind, setfib=, fastopen=, + * backlog=, rcvbuf=, sndbuf=, accept_filter=, deferred, ipv6only=, + * reuseport, so_keepalive=. */ +static ngx_uint_t +ngx_http_ipng_stats_arg_is_set_triggering(ngx_str_t *a) +{ + static const struct { const char *s; size_t n; ngx_uint_t exact; } tbl[] = { + { "bind", 4, 1 }, + { "deferred", 8, 1 }, + { "reuseport", 9, 1 }, + { "setfib=", 7, 0 }, + { "fastopen=", 9, 0 }, + { "backlog=", 8, 0 }, + { "rcvbuf=", 7, 0 }, + { "sndbuf=", 7, 0 }, + { "accept_filter=",14, 0 }, + { "ipv6only=", 9, 0 }, + { "so_keepalive=", 13, 0 }, + }; + ngx_uint_t k; + + for (k = 0; k < sizeof(tbl) / sizeof(tbl[0]); k++) { + if (tbl[k].exact) { + if (a->len == tbl[k].n + && ngx_strncmp(a->data, tbl[k].s, tbl[k].n) == 0) + { + return 1; + } + } else { + if (a->len > tbl[k].n + && ngx_strncmp(a->data, tbl[k].s, tbl[k].n) == 0) + { + return 1; + } + } + } + return 0; +} + + /* The wrapper extracts device= and ipng_source_tag= from cf->args, compacting * the array in place, then calls the original ngx_http_core_module * listen handler. After a successful call it records a binding in @@ -580,24 +624,59 @@ ngx_http_ipng_stats_listen_wrapper(ngx_conf_t *cf, ngx_command_t *cmd, * nginx's handler runs once (creating the socket), subsequent * same-sockaddr listens just register additional device * bindings. Across different server blocks the same sockaddr - * re-appears and nginx merges the cscf via ngx_http_add_server. */ + * re-appears and nginx merges the cscf via ngx_http_add_server. + * + * The `sockaddr_seen_elsewhere` flag catches the + * shared-listen-include pattern where the SAME sockaddr appears + * under multiple server blocks: nginx rejects a second call + * carrying socket-level options (reuseport, bind, backlog=, ...) + * with "duplicate listen options". Those options belong to the + * single kernel socket that backs the sockaddr — the first cscf + * has already applied them; stripping them off subsequent + * same-sockaddr listens lets the include stay symmetrical. */ void *cscf = ngx_http_conf_get_module_srv_conf(cf, ngx_http_core_module); ngx_http_ipng_stats_seen_t *seen = imcf->listens_seen->elts; ngx_uint_t same_cscf_sockaddr = 0; + ngx_uint_t sockaddr_seen_elsewhere = 0; for (i = 0; i < imcf->listens_seen->nelts; i++) { - if (seen[i].cscf != cscf) continue; if (seen[i].socklen != u.addrs[0].socklen) continue; if (ngx_cmp_sockaddr((struct sockaddr *) &seen[i].sockaddr, seen[i].socklen, u.addrs[0].sockaddr, - u.addrs[0].socklen, 1) == NGX_OK) + u.addrs[0].socklen, 1) != NGX_OK) { + continue; + } + if (seen[i].cscf == cscf) { same_cscf_sockaddr = 1; break; } + sockaddr_seen_elsewhere = 1; } if (!same_cscf_sockaddr) { + if (sockaddr_seen_elsewhere) { + ngx_uint_t stripped = 0; + i = 1; + while (i < cf->args->nelts) { + if (ngx_http_ipng_stats_arg_is_set_triggering(&value[i])) { + for (j = i; j + 1 < cf->args->nelts; j++) { + value[j] = value[j + 1]; + } + cf->args->nelts--; + stripped = 1; + continue; + } + i++; + } + if (stripped) { + ngx_conf_log_error(NGX_LOG_NOTICE, cf, 0, + "ipng_stats: stripped socket options from " + "duplicate listen on %V — already applied by " + "an earlier server block", &value[1]); + } + } + rv = ngx_http_core_listen_orig(cf, cmd, conf); if (rv != NGX_CONF_OK) { return rv; diff --git a/tests/01-module/01-e2e.robot b/tests/01-module/01-e2e.robot index b0a91db..e8da7c0 100644 --- a/tests/01-module/01-e2e.robot +++ b/tests/01-module/01-e2e.robot @@ -31,14 +31,16 @@ Module loads Shared-listen-include across multiple server blocks [Documentation] Three server blocks all pull in the same - ... ipng-listens.inc (see docs/user-guide.md). nginx - ... must start without "conflicting server name" or - ... "duplicate listen options" warnings, and the - ... module must end up with exactly one listening - ... socket per address family on port 8080 (one for - ... v4 wildcard, one for v6) — not one per (server - ... block × device × family), which would exhaust - ... the fd table on a real host. + ... ipng-listens.inc (see docs/user-guide.md). The + ... include also carries `reuseport` on every listen + ... — nginx core would normally reject the second + ... server block with "duplicate listen options", but + ... the wrapper strips socket-level options on a + ... repeat (cross-cscf) sockaddr so the first cscf + ... owns the reuseport-cloned socket and the rest + ... merge cleanly. With worker_processes unset + ... (default 1), reuseport produces one socket per + ... (worker × family), i.e. 2 on :8080 here. ${output} = Docker Exec ${SERVER} nginx -t 2>&1 Should Not Contain ${output} conflicting server name Should Not Contain ${output} duplicate listen @@ -46,6 +48,10 @@ Shared-listen-include across multiple server blocks ${count} = Get Regexp Matches ${listens} :8080\\s Length Should Be ${count} 2 ... Expected 2 listening sockets on port 8080 (v4+v6 wildcards); got ${count} + # Proves the cross-cscf option-stripping path actually fired for + # the 2nd and 3rd server blocks. `nginx -t` replays the whole + # config and emits the wrapper's NOTICE each time it strips. + Should Contain ${output} stripped socket options from duplicate listen Prometheus scrape [Documentation] Scrape returns HELP/TYPE preamble. diff --git a/tests/01-module/lab/server/ipng-listens.inc b/tests/01-module/lab/server/ipng-listens.inc index ce657df..754d007 100644 --- a/tests/01-module/lab/server/ipng-listens.inc +++ b/tests/01-module/lab/server/ipng-listens.inc @@ -5,8 +5,15 @@ # file from multiple server blocks to exercise the wrapper's dedup # logic: a naive implementation would either error with "duplicate # listen options" or create N * (devices × families) sockets. +# +# `reuseport` is present on every listen to exercise the wrapper's +# cross-cscf option-stripping path: nginx itself would reject the +# second server block's include with "duplicate listen options" if +# reuseport weren't stripped from the repeat calls. The first cscf's +# listen binds the reuseport-cloned kernel socket; subsequent ones +# merge cleanly into it. -listen 8080 device=eth1 ipng_source_tag=tag1; -listen [::]:8080 device=eth1 ipng_source_tag=tag1; -listen 8080 device=eth2 ipng_source_tag=tag2-v4; -listen [::]:8080 device=eth2 ipng_source_tag=tag2-v6; +listen 8080 reuseport device=eth1 ipng_source_tag=tag1; +listen [::]:8080 reuseport device=eth1 ipng_source_tag=tag1; +listen 8080 reuseport device=eth2 ipng_source_tag=tag2-v4; +listen [::]:8080 reuseport device=eth2 ipng_source_tag=tag2-v6; diff --git a/tests/01-module/lab/server/nginx.conf b/tests/01-module/lab/server/nginx.conf index 2ff5cff..fed3d32 100644 --- a/tests/01-module/lab/server/nginx.conf +++ b/tests/01-module/lab/server/nginx.conf @@ -48,9 +48,14 @@ http { # pair, so each server block gets its own cscf attached but no # server block triggers nginx's "duplicate listen options" # check; + # * strip socket-level options (reuseport, bind, backlog=, ...) + # from cross-cscf repeat sockaddrs — nginx enforces these + # at-most-once per sockaddr, and the first cscf already owns + # the single kernel socket that the remaining cscfs merge + # into; # * dedup bindings globally on (sockaddr, device), so init_module - # creates exactly four sockets here (two families × two - # devices) rather than 3 × 4 = 12. + # creates exactly one binding per (device, family) rather than + # one per (server block × device × family). # The default server owns the locations used by the traffic tests; # the two extras exist only to exercise the shared-include pattern. server {