nginx-ipng-stats-plugin/docs/user-guide.md

<!-- SPDX-License-Identifier: Apache-2.0 -->
# nginx-ipng-stats-plugin — User Guide

This document walks an operator through installing the plugin, deploying it on a single nginx host serving traffic that arrives on
distinct interfaces (GRE tunnels, VLANs, bonded links, or plain ethernet), verifying that counters are flowing, and hooking up the
scrape endpoint to Prometheus and other consumers.

It covers (NFR-7.1):

1. Installing the Debian package.
2. Setting up interfaces for per-device attribution (GRE tunnel example).
3. Writing a minimal nginx configuration.
4. Verifying with `curl`.
5. Scraping from Prometheus.
6. Setting up a global logtail access log.
7. Integrating with scrape consumers.

For a directive-by-directive reference, read [`config-guide.md`](config-guide.md) alongside this guide.

## 1. Install the package

On Debian Trixie (and newer), the module is distributed as `libnginx-mod-http-ipng-stats`. The package depends on the stock `nginx`
package and loads cleanly into it without recompiling nginx itself.

```
sudo apt install ./libnginx-mod-http-ipng-stats_*_amd64.deb
```

The package will:

- Drop `ngx_http_ipng_stats_module.so` into `/usr/lib/nginx/modules/`.
- Place a `load_module` stanza in `/etc/nginx/modules-available/50-mod-http-ipng-stats.conf`.
- Symlink it into `/etc/nginx/modules-enabled/` so nginx picks it up on the next reload.
- Run `nginx -t` and, if the test fails, remove the `modules-enabled` symlink and print a warning — so a broken upgrade never leaves
  you with an nginx that cannot start.

Confirm the module is loaded:

```
nginx -V 2>&1 | grep -o ngx_http_ipng_stats_module
```

## 2. Set up interfaces for per-device attribution

The plugin attributes traffic by watching which interface the request came in on. It enables `IP_PKTINFO` / `IPV6_RECVPKTINFO` on
each listening socket and reads the ingress `ifindex` per accepted connection, so the listening sockets remain plain wildcards and
outgoing packets follow the normal routing table — this is what makes DSR / maglev deployments work, where the SYN arrives via a
GRE tunnel but the SYN-ACK must leave via the default route. For the attribution itself to work, each traffic source that should be
tracked separately MUST arrive on its own interface.

This works with any kind of Linux interface — GRE tunnels, VLANs, VXLANs, bonded links, or plain ethernet. This guide uses GRE
tunnels as the example, but the module does not care about the interface type.

This guide doesn't prescribe a specific networking layer — use whatever your host already uses (`systemd-networkd`, Netplan,
`/etc/network/interfaces`, or a hand-rolled script). The only hard requirement is:

- Each traffic source that should be separately attributed gets its own interface on the nginx host.
- Interfaces follow a consistent naming pattern. For GRE tunnels we recommend `gre-<tag>`, e.g. `gre-mg1`, `gre-mg2`.
- The VIPs are bound to a local dummy or loopback interface so the kernel accepts packets destined for them.

For example, with `systemd-networkd`, a GRE tunnel to a remote peer at `2001:db8::1` from this host at `2001:db8::100` looks like:

```
# /etc/systemd/network/10-gre-mg1.netdev
[NetDev]
Name=gre-mg1
Kind=ip6gre

[Tunnel]
Local=2001:db8::100
Remote=2001:db8::1
TTL=64
```

```
# /etc/systemd/network/10-gre-mg1.network
[Match]
Name=gre-mg1

[Network]
LinkLocalAddressing=no
```

Repeat for each additional tunnel. A trimmed-down variant of this scheme is what IPng uses in production.

Verify the interfaces exist and carry traffic:

```
ip -6 tunnel show | grep gre-mg
ip -6 -s link show gre-mg1
```

## 3. Write the nginx configuration

The plugin needs three things in `nginx.conf`:

1. A shared-memory zone for counters (`ipng_stats_zone`).
2. One device-bound `listen` directive per attributed (interface, address family) pair.
3. A scrape location serving the `ipng_stats` handler.

A minimal working configuration looks like this:

```nginx
load_module modules/ngx_http_ipng_stats_module.so;

events {
    worker_connections 4096;
}

http {
    ipng_stats_zone ipng:4m;
    ipng_stats_flush_interval 1s;
    ipng_stats_default_source direct;

    # Attributed vhost. Wildcard listens below register one binding
    # per (device, family); all collapse to a single kernel socket
    # under the IP_PKTINFO attribution model.
    server {
        include /etc/nginx/ipng-stats/listens.conf;

        server_name _;
        root /var/www/html;
    }

    # Direct (un-attributed) traffic on a separate port — the listen has no
    # device=, so requests get the `ipng_stats_default_source` tag.
    server {
        listen 198.51.100.1:8081 default_server;
        listen [2001:db8::1]:8081 default_server;

        server_name _;
        root /var/www/html;
    }

    # A second server block exposing the scrape endpoint on a locked-down port.
    server {
        listen 127.0.0.1:9113;
        listen [::1]:9113;

        location = /.well-known/ipng/statsz {
            ipng_stats;
            allow 127.0.0.1;
            allow ::1;
            allow 2001:db8::/48;   # your scrape consumers
            deny all;
        }
    }
}
```

And `/etc/nginx/ipng-stats/listens.conf` — the hand-maintained include file — is two lines per attributed interface (one per address
family):

```nginx
listen 80      device=gre-mg1 ipng_source_tag=mg1;
listen [::]:80 device=gre-mg1 ipng_source_tag=mg1;
listen 80      device=gre-mg2 ipng_source_tag=mg2;
listen [::]:80 device=gre-mg2 ipng_source_tag=mg2;
listen 80      device=gre-mg3 ipng_source_tag=mg3;
listen [::]:80 device=gre-mg3 ipng_source_tag=mg3;
listen 80      device=gre-mg4 ipng_source_tag=mg4;
listen [::]:80 device=gre-mg4 ipng_source_tag=mg4;
```

Test and reload:

```
sudo nginx -t
sudo nginx -s reload
```

If `nginx -t` complains about an unknown `listen` parameter (`device=` or `ipng_source_tag=`), the module isn't loaded — check step 1.

### Why wildcard listens?

You do not need to enumerate VIPs in `listen`. A wildcard `listen 80 device=gre-mg1 ipng_source_tag=mg1;` accepts any local address
served through the `gre-mg1` interface, and nginx routes per-request to the right vhost by `server_name` / `Host:` header. Adding a new
VIP is a `server_name` change; adding a new interface is an append to `listens.conf`.

### Sharing a single port across address families and devices

Under the `IP_PKTINFO` attribution model, all listens at a given sockaddr collapse to a single wildcard kernel socket at runtime —
the kernel stamps every accepted connection with its ingress ifindex, and the module looks that up in the table of `device=`
bindings registered by the listen wrapper. Multiple device-tagged wildcard listens on port 80 are therefore not "multiple
sockets"; they're one wildcard socket plus N entries in the attribution table.

A device can reuse one tag across address families or split into per-family tags — whichever reads better in the scrape output:

```nginx
listen 80        device=gre-mg1 ipng_source_tag=mg1;
listen [::]:80   device=gre-mg1 ipng_source_tag=mg1;        # same tag across families
listen 80        device=gre-mg2 ipng_source_tag=mg2-v4;
listen [::]:80   device=gre-mg2 ipng_source_tag=mg2-v6;     # per-family tags
```

A plain `listen 80;` can sit alongside device-tagged listens in the same server block; the wrapper treats the first occurrence
at a given `(server, sockaddr)` pair as the one that registers the kernel socket and lets subsequent device-tagged siblings
register bindings without tripping nginx's duplicate-listen check. Traffic arriving on an interface that has no binding falls
back to `ipng_stats_default_source` (`direct` by default). Keeping "direct" traffic on its own port — e.g.
`listen 198.51.100.1:8081;` — remains a fine pattern when you want a hard split, but it's no longer required.

### Shared includes with `reuseport` (or other socket-level options)

Socket-level `listen` options — `reuseport`, `bind`, `backlog=`, `rcvbuf=`, `sndbuf=`, `setfib=`, `fastopen=`, `accept_filter=`,
`deferred`, `ipv6only=`, `so_keepalive=` — belong to the one kernel socket that backs a given sockaddr, not to a particular
`server { ... }` block. Stock nginx enforces this by accepting them on at most the *first* listen per sockaddr and emitting
`duplicate listen options for <addr>` on any subsequent repeat. That rule collides with the common deployment pattern of a single
`listens.conf` included from every vhost, because each vhost's `include` re-submits the same options.

The wrapper resolves this transparently. When a sockaddr recurs under a different `server` block than the one that first
registered it, the wrapper strips socket-level options from the incoming `cf->args` before delegating to nginx's core listen
handler. The first `server` block owns the options on the kernel socket (including `reuseport`, which triggers per-worker
socket cloning); later blocks merge cleanly via `ngx_http_add_server` and inherit the same socket. The wrapper logs one
`[notice] ipng_stats: stripped socket options from duplicate listen on <addr>` per stripped listen — informational, not an
error. So this include works unchanged across as many vhosts as you like:

```nginx
listen 443 ssl reuseport device=gre-mg1 ipng_source_tag=mg1;
listen [::]:443 ssl reuseport device=gre-mg1 ipng_source_tag=mg1;
```

`reuseport` noticeably helps worker load-balancing on busy hosts: without it, a single shared listening socket forces workers
to compete for accepts and traffic routinely concentrates on one or two workers. HTTP/2 and long-lived keepalive connections
can still skew CPU toward whichever worker holds a few heavy clients — `reuseport` does not reshuffle existing connections —
but new-connection distribution across workers becomes kernel-hashed, not first-ready-wins.

## 4. Verify with curl

Generate some traffic (or wait for real traffic), then scrape the endpoint locally:

```
curl -s http://127.0.0.1:9113/.well-known/ipng/statsz
```

Default output is Prometheus text format:

```
# HELP nginx_ipng_requests_total Total HTTP requests.
# TYPE nginx_ipng_requests_total counter
nginx_ipng_requests_total{source_tag="mg1",vip="192.0.2.10",code="2xx"} 12345
nginx_ipng_requests_total{source_tag="mg1",vip="192.0.2.10",code="4xx"} 17
nginx_ipng_requests_total{source_tag="mg2",vip="192.0.2.10",code="2xx"} 9876
nginx_ipng_requests_total{source_tag="direct",vip="192.0.2.10",code="2xx"} 42
# HELP nginx_ipng_bytes_in_total Request bytes received.
# TYPE nginx_ipng_bytes_in_total counter
nginx_ipng_bytes_in_total{source_tag="mg1",vip="192.0.2.10",code="2xx"} 9876543
# ... and so on

# Histogram series (request_duration, upstream_response, bytes_in, bytes_out)
# do NOT carry a `code` label — they aggregate across classes per (source, vip).
nginx_ipng_request_duration_seconds_bucket{source_tag="mg1",vip="192.0.2.10",le="0.050"} 11200

# In-flight gauges per (source, vip). These are point-in-time request counts,
# not rates: `active` = requests observed at POST_READ that haven't finalized
# yet; `reading` = in pre-response phases (rewrite/access/content); `writing`
# = past header send. reading + writing = active at any instant.
nginx_ipng_active{source_tag="mg1",vip="192.0.2.10"} 3
nginx_ipng_reading{source_tag="mg1",vip="192.0.2.10"} 1
nginx_ipng_writing{source_tag="mg1",vip="192.0.2.10"} 2
```

For JSON output instead, set the `Accept` header:

```
curl -s -H 'Accept: application/json' http://127.0.0.1:9113/.well-known/ipng/statsz | jq .
```

To filter server-side to a single source tag:

```
curl -s 'http://127.0.0.1:9113/.well-known/ipng/statsz?source_tag=mg1'
curl -s 'http://127.0.0.1:9113/.well-known/ipng/statsz?source_tag=mg1&vip=192.0.2.10'
```

If you see `source_tag="direct"` entries with non-zero counts and you expected all traffic to come in via attributed interfaces,
something is routing around them — typically an interface that isn't in `listens.conf`, or an interface that's down.

## 5. Scrape from Prometheus

The same endpoint serves Prometheus text by default. Add a scrape job:

```yaml
# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: nginx-ipng
    scrape_interval: 15s
    static_configs:
      - targets:
          - 'nginx-backend-1.example.com:9113'
          - 'nginx-backend-2.example.com:9113'
    metrics_path: /.well-known/ipng/statsz
```

You'll want to add `nginx-backend-*` to your `allow` rules in the scrape server block, or front the plugin with a TLS-terminating
reverse proxy. The module does not ship its own auth; the nginx `allow`/`deny` ACL is your access control.

Typical PromQL queries:

```
# Requests per second per source, per VIP:
sum by (source_tag, vip) (rate(nginx_ipng_requests_total[1m]))

# 5xx error rate per VIP, aggregated across all sources:
sum by (vip) (rate(nginx_ipng_requests_total{code="5xx"}[5m]))
  /
sum by (vip) (rate(nginx_ipng_requests_total[5m]))

# p95 request duration per (source_tag, vip):
histogram_quantile(0.95,
    sum by (source_tag, vip, le) (rate(nginx_ipng_request_duration_seconds_bucket[5m])))

# In-flight concurrency per (source_tag, vip). Gauges are exported as-is;
# use max_over_time for load-shedding alerts or avg_over_time for capacity
# planning:
max_over_time(nginx_ipng_active[5m])
```

## 6. Set up a global logtail access log

Operators who want a single unified access log covering all traffic — regardless of which `server` block handled the request — normally
have to repeat `access_log` in every `server {}` block or rely on a catch-all virtual host. The `ipng_stats_logtail` directive removes
that requirement: one line at the `http` level registers a global log-phase writer that fires unconditionally for every request (FR-8.1).

The logtail is also the recommended escape hatch when you need richer cardinality than the stats zone exposes. The Prometheus counters
deliberately collapse HTTP status codes into six class lanes (`1xx`..`5xx`/`unknown`) to keep scrape size bounded. Operators who need
per-three-digit-code, per-path, per-user-agent, or any other high-cardinality breakdown should ship the logtail stream to an off-path
analytics receiver and compute those views there — that work happens in a different process and never touches the nginx hot path.

The logtail sends each buffer flush as a single UDP datagram to a `host:port`. Zero disk I/O, no backpressure, no blocking if the
receiver is down. This makes it ideal for fire-and-forget analytics pipelines where delivery guarantees are unnecessary and disk writes
would add unwanted I/O pressure. For file-based access logging, use nginx's built-in `access_log` directive.

### Define the log format

Add a `log_format` declaration inside the `http { ... }` block, **before** the `ipng_stats_logtail` directive that references it:

```nginx
log_format ipng_stats_logtail '$host\t$remote_addr\t$request_method\t$request_uri\t'
                              '$status\t$body_bytes_sent\t'
                              '$ipng_source_tag\t$server_addr\t$scheme';
```

Any nginx variable is usable here, including `$ipng_source_tag` (the device attribution tag, FR-6.1), `$server_addr` (the VIP
that received the request), and `$scheme` (`http` or `https` — useful since `$server_addr` alone doesn't distinguish ports).

### Configuration

```nginx
http {
    ipng_stats_zone ipng:4m;

    log_format ipng_stats_logtail '$host\t$remote_addr\t$request_method\t$request_uri\t'
                                  '$status\t$body_bytes_sent\t'
                                  '$ipng_source_tag\t$server_addr\t$scheme';

    ipng_stats_logtail ipng_stats_logtail udp://127.0.0.1:9514 buffer=16k flush=1s;

    server { ... }
}
```

- **`ipng_stats_logtail`** (first argument) — the `log_format` name.
- **`udp://127.0.0.1:9514`** — destination as a `udp://host:port` URI. `host` must be a literal IPv4 address (no hostnames, no IPv6
  in v0.1).
- **`buffer=16k`** — per-worker write buffer. Lines are held in memory until the buffer fills, the flush timer fires, or the worker
  exits. Default is `64k`; minimum is `1k` (FR-8.3).
- **`flush=1s`** — maximum age of buffered data before it is sent. Default is `1s`; minimum is `100ms` (FR-8.3).

Each buffer flush becomes a single `sendto()` on a per-worker `SOCK_DGRAM` socket. When the flush timer fires (or the buffer fills),
the entire buffered payload is sent as one datagram — no file open, no `write()`, no `fsync()`. If no receiver is listening, the kernel
drops the datagram silently and the worker carries on. This is by design: the logtail exists for non-critical analytics pipes where
lost datagrams are acceptable and disk I/O is not.

**Constraints (v0.1):**

- `host` must be a literal IPv4 address. Hostnames and IPv6 are not supported yet.
- Large `buffer=` values produce large datagrams. On the loopback interface the practical ceiling is ~64 KB, well above typical
  configured buffer sizes. On routed paths, path MTU applies.
- There is no acknowledgment, retry, or sequence number. If the receiver is down, the data is gone.

### Filtering with `if=`

High-frequency requests like health checks can be suppressed from the logtail stream using the `if=$variable` parameter. Use a `map`
block to define which requests should be logged:

```nginx
map $request_uri $logtail_enabled {
    ~^/\.well-known/ipng/healthz  0;
    default                       1;
}

ipng_stats_logtail ipng_stats_logtail udp://127.0.0.1:9514 buffer=16k flush=1s if=$logtail_enabled;
```

Filtered requests are still counted by the stats module — only the logtail output is suppressed. The condition is checked before the
log format is rendered, so filtered requests have zero logtail overhead. Multiple conditions can be combined using nested `map` blocks.

See [`config-guide.md`](config-guide.md#conditional-logging-with-if) for the full semantics.

**Starting a receiver** is trivial:

```bash
# Quick one-shot inspection:
nc -u -l 127.0.0.1 9514
```

For a production-ready logtail consumer, see [`nginx-logtail`](https://git.ipng.ch/ipng/nginx-logtail), which receives the UDP
datagram stream and processes it into structured log output.

A typical received log line (with the format above, tab-separated) looks like:

```
example.com	203.0.113.42	GET	/index.html	200	4321	mg1	192.0.2.10	https
```

The `mg1` field comes from `$ipng_source_tag` and `https` from `$scheme` — free per-device attribution and protocol visibility in
every log line.

### Why this complements per-server `access_log`

A conventional nginx access log requires the operator to repeat `access_log /path/to/file logtail;` in every `server {}` block that
should be captured. This is error-prone: adding a new vhost and forgetting the directive means that vhost's traffic is silently absent
from the log. `ipng_stats_logtail` is installed at the module's log-phase hook, which nginx calls for every request with no per-server
configuration required.

See [`config-guide.md`](config-guide.md#ipng_stats_logtail-format_name-udphostport-buffersize-flushduration) for the full directive
reference and FR-8 for the requirements behind this feature.

## 7. Integrate with scrape consumers

The scrape endpoint (`ipng_stats;`) serves both Prometheus text and JSON from a single location. Any HTTP client that can issue a GET
request can consume it. Two integration patterns are common:

### Prometheus

See section 5 above. Prometheus scrapes the endpoint at a configured interval and stores the time series. This is the simplest
integration and covers most monitoring and alerting use cases.

### Custom consumers

The `?source_tag=<tag>` query parameter lets a consumer filter the scrape response to only the traffic attributed to a specific source.
This is useful when multiple consumers share the same nginx backends — each consumer scrapes with its own tag and never sees the
others' traffic.

The JSON output (`Accept: application/json`) includes a top-level `schema` field for versioning, making it straightforward to parse
from any language.

Once wired, a consumer can derive from the scrape data:

- Live QPS per backend (from the EWMA gauges).
- Status-class mix per backend (the six-lane `1xx`..`5xx`/`unknown` counter families). Full three-digit codes are not exported by the
  scrape endpoint; route the logtail stream off-host and aggregate there if you need per-code breakdowns.
- p50/p95 latency per backend (from the duration histogram, aggregated across classes).
- Traffic volume per backend (from the bytes counters and the new bytes histograms).

For an example of this pattern in a GRE tunnel fleet, see [`vpp-maglev`](https://git.ipng.ch/ipng/vpp-maglev), whose frontend scrapes
each nginx backend filtered by source tag to show per-backend traffic alongside health state.

## Troubleshooting

**`nginx -t` reports "unknown listen parameter: device=" or "unknown listen parameter: ipng_source_tag=".** The module isn't loaded.
Check `/etc/nginx/modules-enabled/` for the `50-mod-http-ipng-stats.conf` symlink and re-run `nginx -t`.

**All traffic is attributed to `direct` even though device-bound interfaces exist.** The interface names don't match the `device=`
values in `listens.conf`, or the interfaces aren't up. Run `ip -br link` and confirm the interface names match.

**Counters reset after every reload.** They should survive `nginx -s reload`. If they don't, check that the `ipng_stats_zone` name in
`nginx.conf` is stable across reloads — renaming the zone forces a new shared-memory segment.

**`nginx_ipng_zone_full_events_total` is non-zero.** The shared-memory zone is too small for your VIP count. Increase the size in
`ipng_stats_zone ipng:<size>` (default 4 MB is enough for ~hundreds of VIPs — the code dimension is bucketed to six classes, so
one 4 MB zone holds a very large deployment).

**`nginx_ipng_ifindex_misses_total` is climbing.** A connection arrived on an interface whose ifindex isn't in the binding table.
Two common causes: (a) a configured interface was torn down and recreated (e.g. a GRE tunnel reprovision) and now has a fresh
ifindex — the per-worker rescan timer (`ipng_stats_rescan_interval`, default `60s`) will pick it up on the next tick; (b) traffic
legitimately arrives on an interface that no `device=` binding claims — either add the binding or accept that it lands under the
default source. If the counter keeps rising between rescans, shorten `ipng_stats_rescan_interval` or trigger `nginx -s reload` to
re-resolve immediately.

**`curl http://127.0.0.1:9113/.well-known/ipng/statsz` returns "403 Forbidden".** The `allow`/`deny` ACL is blocking your source address. Either add
yourself or scrape from a host already in the allow list.

## Where to go next

- [`config-guide.md`](config-guide.md) — every directive and `listen` parameter with contexts, allowed values, and defaults.
- [`design.md`](design.md) — full design document, including the attribution model, hot-path cost analysis, and failure modes.