PRE-RELEASE v0.7.0

Self-heal device= → ifindex attribution and expose plugin meta counters in the scrape. ipng_stats_rescan_interval (default 60s, 0 to disable) runs a per-worker timer that re-resolves every binding via if_nametoindex, so interface teardown/recreate (e.g. GRE tunnel reprovision) picks up the new ifindex without requiring an nginx reload. nginx_ipng_ifindex_misses_total increments whenever a cmsg-reported ingress ifindex doesn't match any binding — making stale mappings observable. Also expose the existing zone_full_events and flushes_total shared-memory counters, which were tracked but never emitted. JSON output gains a top-level "meta" object; schema stays at 2 (additive change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:37:15 +02:00
parent 59f3deef66
commit 8e0b1cdde9
5 changed files with 218 additions and 32 deletions
--- a/docs/config-guide.md
+++ b/docs/config-guide.md
@@ -81,6 +81,22 @@ is sized so that a scrape interval of 5–15 s sees effectively no lag.

 See FR-4.2, FR-5.2.

+### `ipng_stats_rescan_interval <duration>`
+
+**Context:** `http`.
+
+**Value:** an nginx duration string (e.g. `30s`, `60s`, `5m`) or `0` to disable.
+
+**Default:** `60s`.
+
+**Minimum:** `1s` (when non-zero).
+
+**Effect:** sets the cadence of a per-worker timer that re-resolves every `device=<ifname>` binding via `if_nametoindex(3)`. This
+self-heals the attribution table when a configured interface is torn down and recreated (e.g. a GRE tunnel reprovision) — it gets a
+fresh kernel ifindex, which the next rescan picks up. Between the kernel change and the next tick, arriving traffic falls through to
+the default source and increments `nginx_ipng_ifindex_misses_total`; watch that counter to size this interval. Set to `0` to disable
+and rely solely on `nginx -s reload` (which always re-runs `if_nametoindex` for every binding in the new cycle).
+
 ### `ipng_stats_default_source <tag>`

 **Context:** `http`.
@@ -276,7 +292,8 @@ per-three-digit-code breakdown should enable `ipng_stats_logtail` and derive it
 | `nginx_ipng_zone_bytes_used` | gauge | — | Shared-memory zone bytes currently allocated. |
 | `nginx_ipng_zone_bytes_total` | gauge | — | Shared-memory zone capacity in bytes. |
 | `nginx_ipng_zone_full_events_total` | counter | — | Number of key insertions dropped because the zone was full. |
-| `nginx_ipng_flushes_total` | counter | `worker` | Number of per-worker flush ticks executed. |
+| `nginx_ipng_flushes_total` | counter | — | Per-worker flushes into the shared zone, summed across workers. |
+| `nginx_ipng_ifindex_misses_total` | counter | — | Connections whose ingress ifindex did not match any configured `device=` binding. |
 | `nginx_ipng_flush_duration_seconds` | histogram | `worker` | Histogram of flush durations. |
 | `nginx_ipng_scrape_duration_seconds` | histogram | — | Histogram of scrape handler runtimes. |

@@ -287,6 +304,11 @@ See FR-2.*, FR-3.7.
 ```json
 {
  "schema": 2,
+  "meta": {
+    "ifindex_misses": 0,
+    "zone_full_events": 0,
+    "flushes_total": 1234
+  },
  "records": [
    {
      "source_tag": "mg1",
@@ -321,6 +343,7 @@ See FR-3.6.
 | --- | --- | --- | --- | --- |
 | `ipng_stats_zone` | ✅ | — | — | — |
 | `ipng_stats_flush_interval` | ✅ | — | — | — |
+| `ipng_stats_rescan_interval` | ✅ | — | — | — |
 | `ipng_stats_default_source` | ✅ | — | — | — |
 | `ipng_stats_buckets` | ✅ | — | — | — |
 | `ipng_stats_byte_buckets` | ✅ | — | — | — |
--- a/docs/user-guide.md
+++ b/docs/user-guide.md
@@ -434,6 +434,13 @@ values in `listens.conf`, or the interfaces aren't up. Run `ip -br link` and con
 `ipng_stats_zone ipng:<size>` (default 4 MB is enough for ~hundreds of VIPs — the code dimension is bucketed to six classes, so
 one 4 MB zone holds a very large deployment).

+**`nginx_ipng_ifindex_misses_total` is climbing.** A connection arrived on an interface whose ifindex isn't in the binding table.
+Two common causes: (a) a configured interface was torn down and recreated (e.g. a GRE tunnel reprovision) and now has a fresh
+ifindex — the per-worker rescan timer (`ipng_stats_rescan_interval`, default `60s`) will pick it up on the next tick; (b) traffic
+legitimately arrives on an interface that no `device=` binding claims — either add the binding or accept that it lands under the
+default source. If the counter keeps rising between rescans, shorten `ipng_stats_rescan_interval` or trigger `nginx -s reload` to
+re-resolve immediately.
+
 **`curl http://127.0.0.1:9113/.well-known/ipng/statsz` returns "403 Forbidden".** The `allow`/`deny` ACL is blocking your source address. Either add
 yourself or scrape from a host already in the allow list.