diff --git a/content/articles/2026-05-15-vpp-maglev-3.md b/content/articles/2026-05-15-vpp-maglev-3.md new file mode 100644 index 0000000..8c9b043 --- /dev/null +++ b/content/articles/2026-05-15-vpp-maglev-3.md @@ -0,0 +1,555 @@ +--- +date: "2026-05-15T18:22:11Z" +title: VPP with Maglev Loadbalancing - Part 3 +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. + +In the [[first article]({{< ref "2026-04-30-vpp-maglev" >}})] of this series, I looked at the Maglev +algorithm and how it is implemented in VPP. Then, I wrote about a health checking controller called +[[vpp-maglev](https://git.ipng.ch/ipng/vpp-maglev)], and its architecture, server, client and +frontend in a [[second piece]({{< ref 2026-05-08-vpp-maglev-2 >}})]. The traffic flows in a somewhat +more complex way now from users to IPng's webservers, so this article dives into the observability +that makes this system manageable. + +## Introduction + +One might argue the Maglev delivery system is elegant. Traffic comes in over anycasted VIPs, VPP +hashes each new TCP flow onto an application backend using the Maglev algorithm. The backend +receives the packet via a GRE6 tunnel and responds directly to the client. Elegant, stateless, fast. + +{{< image width="10em" float="left" src="/assets/smtp/pulling_hair.png" alt="Pulling Hair" >}} + +I would argue, this is elegant right up until something explodes, and then all of a sudden the +system is just an opaque blackbox: VPP operates at the flow level and has no knowledge of individual +HTTP requests. GRE delivery means nginx sees the client IP but not which maglev sent it. Direct +Server Return means the response never passes through the original VPP Maglev instance again. +Nothing in this chain sees the full picture of a single request's journey, which will make me lose +sleep. And I love sleeping. + +To close that gap, I need to build an observability layer on top of this system: + +1. **vpp-maglev** itself exposes backend health state and VPP dataplane counters over Prometheus. +2. **nginx-ipng-stats-plugin** is an nginx module that attributes each request to the Maglev + frontend that delivered it, and exports per-VIP, per-frontend counters over Prometheus. +3. **nginx-logtail** is a Go pipeline that ingests per-request log lines from all nginx instances + and maintains globally ranked top-K tables across multiple time windows for high-cardinality + queries, exposed via gRPC and rollup stats, once again, over Prometheus. + +Before explaining more, I need to get something off my chest. Some readers asked me based on my last +article, why would I name the header `X-IPng-Frontend` even though it's clearly a maglev backend? It +turns out, the words "frontend" and "backend" are overloaded, pretty much in any system that has +more than two components. In a chain (VPP Maglev <-> nginx <-> docker container), nginx is both a +backend (of the maglev) and a frontend (of the docker container), so I'll use the following terms to +try to disambiguate: + +- **maglev frontend**: a VPP machine that announces BGP anycast VIPs and forwards traffic to nginx + machines via GRE6 tunnels. +- **nginx frontend**: an nginx machine that receives GRE-wrapped packets, unwraps them, and proxies + requests to application containers. From Maglev's perspective it is an application server (AS); + from the web service perspective it is the front door. +- **nginx backend**: a Docker container running an application like Hugo, Nextcloud, or Immich. This + is what the user's request ultimately reaches. + +IPng currently announces two IPv4/IPv6 anycasted VIPs (`vip0.l.ipng.ch` and `vip1.l.ipng.ch`) from +four maglev frontends (`chbtl2`, `frggh1`, `chlzn1`, `nlams1`) and eight nginx frontends spread +across Switzerland, France, and the Netherlands. Adding/Removing instances of Maglev and instances +of nginx is non-intrusive and can be done in mere minutes with Ansible and Kees. + +### Maglev: GRE Delivery + +VPP delivers each packet to an nginx machine by wrapping it in a GRE6 tunnel. GRE6 uses IPv6 as the +outer header regardless of whether the inner packet is IPv4 or IPv6. The per-packet overhead is +fixed at 44 bytes, the outer IPv6 header taking 40 bytes and the GRE header, assuming no +key/checksum, weighs in at an additional 4 bytes. A standard 1500-byte client packet wrapped in GRE6 +comes out at 1544 bytes regardless of address family: + +| Inner packet | Breakdown | Inner L3 size | GRE6 overhead | Wrapped total | +|---|---|---|---|---| +| IPv4 TCP | 20B IPv4 + 20B TCP + 1460B payload | 1500 bytes | 40+4 bytes | **1544 bytes** | +| IPv6 TCP | 40B IPv6 + 20B TCP + 1440B payload | 1500 bytes | 40+4 bytes | **1544 bytes** | + +I decide to configure the GRE tunnel interfaces on each nginx machine with an MTU of 2026 bytes, +precisely to accommodate those 1544-byte wrapped packets without fragmentation. The 44-byte +encapsulation overhead is an internal implementation detail; from the internet's perspective, +traffic wants to flow at the standard 1500-byte MTU end to end. + +That last point is why [[MSS clamping](https://en.wikipedia.org/wiki/Maximum_segment_size)] is +necessary. When nginx sends its SYN-ACK in the three-way handshake, it would normally derive its +advertised MSS from its own interface MTU. An interface MTU of 2026 would yield an MSS of 1986 bytes +(IPv4) or 1966 bytes (IPv6), telling the client it can send segments of almost 2000 bytes. Those +segments would travel across the internet on a standard 1500-byte path and cause fragmentation or +black-holing before they even reached VPP. No bueno! + +MSS clamping in the SYN-ACK overrides that interface-derived value and advertises the standard +internet MSS instead: + +- IPv4 clients: 1500 - 20 (IPv4) - 20 (TCP) = **1460 bytes** +- IPv6 clients: 1500 - 40 (IPv6) - 20 (TCP) = **1440 bytes** + +The client then sends standard-sized segments. Those arrive at VPP as 1500-byte packets, get +GRE6-wrapped to 1544 bytes, and arrive at the nginx GRE interface well within its 2026-byte MTU. +On the return path, nginx sends responses directly to the client via DSR at the standard 1500-byte +MTU. + +On each nginx, I first allow GRE6 from the Maglev frontends, and apply the MSS clamping on the +(larger MTU) internet-facing interface `enp1s0f0`. Then, using netplan, I'll create an IP6GRE tunnel +to each Maglev frontend, using descriptive interface names which reveal who owns the remote side of +the tunnel: + +``` +pim@nginx0-chlzn0:~$ sudo ip6tables -A INPUT -p gre -s 2001:678:d78::/96 -j ALLOW +pim@nginx0-chlzn0:~$ sudo ip6tables -A POSTROUTING -o enp1s0f0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1440 +pim@nginx0-chlzn0:~$ sudo iptables -A POSTROUTING -o enp1s0f0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1460 +pim@nginx0-chlzn0:~$ cat /etc/netplan/01-netcfg.yaml +network: + version: 2 + renderer: networkd + ethernets: + ... + tunnels: + chbtl2: + mode: ip6gre + mtu: 1544 + local: 2001:678:d78:f::2:0 + remote: 2001:678:d78::e + addresses: [ 194.1.163.31/32, 2001:678:d78::1:0:1/128, 194.126.235.31/32, 2a0b:dd80::1:0:1/128 ] + chlzn1: + mode: ip6gre + mtu: 1544 + local: 2001:678:d78:f::2:0 + remote: 2001:678:d78::10 + addresses: [ 194.1.163.31/32, 2001:678:d78::1:0:1/128, 194.126.235.31/32, 2a0b:dd80::1:0:1/128 ] + ... +``` + +With this, I know that traffic arriving on `chbtl2` was forwarded by the `chbtl2` maglev frontend. +That mapping is the key that makes per-source attribution possible, which is a property that I can +exploit in an nginx plugin. Clever! + +### NGINX: Stats Plugin + +I wrote a tiny nginx module on +[[nginx-ipng-stats-plugin](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin)] that counts requests +per VIP, split by which interface (in my case, which GRE tunnel) delivered them. It is packaged as a +standard Debian `libnginx-mod-http-ipng-stats` package on [[deb.ipng.ch](https://deb.ipng.ch/)] and +loads into stock upstream nginx without recompiling nginx itself. The design document and user guide +are in the [[docs/](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin/src/branch/main/docs)] +directory of the repo. + +A tcpdump on the GRE tunnel interface confirms the handshake looks correct. Both sides settle on +the standard internet MSS values despite the larger internal MTU: + +``` +pim@nginx0-chlzn0:~$ sudo tcpdump -i any -n '(tcp[tcpflags] & tcp-syn) != 0 or (ip6 and ip6[6]=6 and (ip6[13+40] & 2) != 0)' +05:13:46.547826 chbtl2 In IP 162.19.252.246.39246 > 194.1.163.31.443: Flags [S], seq 2576867891, + win 64240, options [mss 1460,sackOK,TS val 767241235 ecr 0,nop,wscale 7], length 0 +05:13:46.547860 enp1s0f0 Out IP 194.1.163.31.443 > 162.19.252.246.39246: Flags [S.], seq 3931956759, + ack 2576867892, win 65142, options [mss 1460,sackOK,TS val 681127624 ecr 767241235,nop,wscale 7], length 0 + +05:13:46.584236 chlzn1 In IP6 2a03:2880:f812:5e::.28858 > 2a0b:dd80::1:0:1.443: Flags [S], seq 36254033, + win 65535, options [mss 1380,sackOK,TS val 3022307959 ecr 0,nop,wscale 8], length 0 +05:13:46.584300 enp1s0f0 Out IP6 2a0b:dd80::1:0:1.443 > 2a03:2880:f812:5e::.28858: Flags [S.], seq 3586034356, + ack 36254034, win 64482, options [mss 1440,sackOK,TS val 1977557327 ecr 3022307959,nop,wscale 7], length 0 +``` + +The thing to look for here, is the IPv4 address from OVH was told in the packet going out on +`enp1s0f0` that we are happy to take MSS of 1460 for this IPv4 TCP connection, despite us having a +larger MTU on the interface. Similarly, the IPv6 address from Meta was told we'll accept MSS of +1440. That's MSS clamping at work! + +#### A curious case of `SO_BINDTODEVICE` versus `IP_PKTINFO` + +Attributing a connection to its ingress interface sounds simple: bind each listening socket to a +specific interface with `SO_BINDTODEVICE`. Traffic arriving on `chbtl2` goes to that socket; +traffic on `nlams1` goes to another socket; attribution falls out of which socket accepted the +connection. And this approach signaled my first failure :-) + +The problem is that `SO_BINDTODEVICE` affects both ingress _and egress_. A socket bound to +`chbtl2` device will try to route its return traffic back through that GRE tunnel, back to the +Maglev machine. That completely breaks direct server return, which requires responses to leave via +the default gateway (the internet-facing NIC). `SO_BINDTODEVICE` on a GRE interface and DSR are +mutually exclusive. Yikes. + +{{< image width="8em" float="left" src="/assets/shared/brain.png" alt="brain" >}} + +But Linux has another trick up its kernelistic sleeve: `IP_PKTINFO` (for +[[IPv4](https://man7.org/linux/man-pages/man7/ip.7.html)]) and `IPV6_RECVPKTINFO` (for +[[IPv6](https://man7.org/linux/man-pages/man7/ipv6.7.html)]). These socket options tell the kernel +to attach a control message (cmsg) to each accepted connection containing the interface index on +which the packet arrived. The listening socket itself remains a wildcard bound to no specific +interface, so outgoing packets follow the normal routing table and leave via the default gateway. +Attribution comes from reading the cmsg at connection time, not from socket binding. Whoot! + +In pseudocode, reading the ingress interface from the ancillary data looks like this: + +```c +// Enable IP_PKTINFO on the socket at init time. +setsockopt(fd, IPPROTO_IP, IP_PKTINFO, &one, sizeof(one)); +setsockopt(fd, IPPROTO_IPV6, IPV6_RECVPKTINFO, &one, sizeof(one)); + +// At accept time, read the cmsg to find the ifindex. +struct msghdr msg = { ... }; +recvmsg(fd, &msg, 0); +for (struct cmsghdr *cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { + if (cm->cmsg_level == IPPROTO_IP && cm->cmsg_type == IP_PKTINFO) { + struct in_pktinfo *pki = (struct in_pktinfo *)CMSG_DATA(cm); + ifindex = pki->ipi_ifindex; // <- the ingress interface + } + if (cm->cmsg_level == IPPROTO_IPV6 && cm->cmsg_type == IPV6_PKTINFO) { + struct in6_pktinfo *pki = (struct in6_pktinfo *)CMSG_DATA(cm); + ifindex = pki->ipi6_ifindex; // <- the ingress interface + } +} +``` + +The module enables both socket options on every nginx listening socket and reads the ifindex from +the cmsg on each accepted connection. It then looks the ifindex up in a table built at +configuration time from the `device=` parameters on the `listen` directives. The match +produces a short attribution tag. Connections that arrive on an interface with no registered +binding fall back to the configurable default tag (called `direct`), which handles unattributed +traffic like direct HTTPS connections that bypass Maglev entirely. + +#### NGINX Plugin + +Three things are needed in `nginx.conf`: a shared memory zone for counters, device-bound `listen` +directives that map each GRE interface to a source tag, and a scrape location: + +```nginx +http { + ipng_stats_zone ipng:4m; + + server { + # One device-tagged listen per GRE interface per address family. + # Each 'listen' tells the module which ifname maps to which tag. + listen 80 device=chbtl2 ipng_source_tag=chbtl2; + listen [::]:80 device=chbtl2 ipng_source_tag=chbtl2; + listen 443 device=chbtl2 ipng_source_tag=chbtl2 ssl; + listen [::]:443 device=chbtl2 ipng_source_tag=chbtl2 ssl; + listen 80 device=nlams1 ipng_source_tag=nlams1; + listen [::]:80 device=nlams1 ipng_source_tag=nlams1; + listen 443 device=nlams1 ipng_source_tag=nlams1 ssl; + listen [::]:443 device=nlams1 ipng_source_tag=nlams1 ssl; + # ... repeat for frggh1, chlzn1 ... + } + + # Scrape endpoint on a management port. + server { + listen 127.0.0.1:9113; + location = /.well-known/ipng/statsz { + ipng_stats; + allow 127.0.0.1; + deny all; + } + } +} +``` + +The cool trick is, that multiple `device=` listens on the same port are not multiple kernel sockets; +under this `IP_PKTINFO` model they collapse to a single wildcard socket, and the module distinguishes +traffic by reading the ifindex from the cmsg. Adding a new VIP is a `server_name` change in nginx; +adding a new maglev frontend is a two-line append to the listens file. Neither requires a restart. + +I kept the hot path intentionally minimal. On each request's log phase, the nginx worker increments +a counter in its own private table. There are no locks; no atomics; nothing fancy. Just an integer +increment into memory that only this worker ever writes. A periodic timer (default one second) then +flushes the worker's private deltas into the shared memory zone using atomic adds. The scrape +handler called `ipng_stats` reads only from shared memory. + +The module also registers an nginx variable `$ipng_source_tag` that resolves to the attribution +tag for the current connection. That variable is available in `log_format`, `map`, `add_header`, +and any other directive that accepts nginx variables, which is how the logtail pipeline gets the +attribution. + +Scraping the endpoint confirms attribution is working. With `Accept: text/plain`, the output is +Prometheus text format; with `Accept: application/json`, it is JSON. Both support `source_tag=` +and `vip=` query parameters to filter the output: + +```sh +pim@nginx0-chlzn0:~$ curl -Ss http://localhost:9113/.well-known/ipng/statsz?source_tag=chbtl2 +nginx_ipng_requests_total{source_tag="chbtl2",vip="194.1.163.31",code="4xx"} 100062 +nginx_ipng_requests_total{source_tag="chbtl2",vip="194.1.163.31",code="2xx"} 14621209 +nginx_ipng_requests_total{source_tag="chbtl2",vip="2001:678:d78::1:0:1",code="4xx"} 6340 +nginx_ipng_requests_total{source_tag="chbtl2",vip="2001:678:d78::1:0:1",code="2xx"} 10339863 +nginx_ipng_bytes_in_sum{source_tag="chbtl2",vip="194.1.163.31"} 1599408141.000 +nginx_ipng_bytes_in_sum{source_tag="chbtl2",vip="2001:678:d78::1:0:1"} 2405616085.000 +nginx_ipng_bytes_out_sum{source_tag="chbtl2",vip="194.1.163.31"} 418826340291.000 +nginx_ipng_bytes_out_sum{source_tag="chbtl2",vip="2001:678:d78::1:0:1"} 47520361606.000 +# ... +``` + +The `code` label is bucketed into six classes (`1xx`..`5xx` and `unknown`), which keeps +per-VIP cardinality bounded regardless of how many distinct HTTP status codes appear in the wild +(and trust me, they *all* occur in the wild). The module also exports request duration histograms +and upstream response time histograms, but the request/byte counters above are the day-to-day +workhorses. + +### NGINX: Log Hook + +Since I'm looking at all requests in the nginx log phase anyway, I thought perhaps I could go one +step further. Prometheus counters answer "how much traffic?" but not "from whom?" or "to which +URI?". Adding per-client-IP or per-URI dimensions to Prometheus would be a catastrophic idea during +a DDoS: a modest attack with one million source IPs would create one million Prometheus time series +and cause the monitoring system to be the first casualty of the incident. The C module is +deliberately narrow by design. + +Instead, every request emits a structured log line that carries all the high-cardinality +dimensions. The format used by the stats plugin's logtail integration is: + +```nginx +log_format ipng_stats_logtail + 'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status' + '\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag' + '\t$server_addr\t$scheme'; +``` + +The `v1\t` prefix is a version tag. When the format needs to evolve, a new version (say `v2`) +can be added while old emitters are still running; a reader can route each packet to the +appropriate parser by looking at the version. In case you're curious, these variables like `$is_tor` +come from a map I maintain with TOR exit nodes, and `$asn` comes from Maxmind GeoIP. Check it out +[[here](https://www.maxmind.com/en/geoip-databases)]. + +{{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}} + +Adding an `access_log` directive in every `server` or `location` block would be error-prone and +would miss any newly added vhost. It would also potentially cause a lot of disk activity to log both +the `logtail` and the regular `access_log`. I decide that my stats plugin will provide an +`ipng_stats_logtail` directive at the `http` level that fires globally for every request, regardless +of which server or location handled it. Because why not? + +To exclude noisy requests like health probes from the logtail stream, I add an idiomatic `if=` parameter +which evaluates an nginx variable at log phase and suppresses emission when the value is empty or `0`. +A `map` block is the idiomatic way to build that variable: + +```nginx +map $request_uri $logtail_skip_uri { + ~^/\.well-known/ipng 1; + default 0; +} + +map $host $logtail_skip_host { + maglev.ipng.ch 1; + default 0; +} + +map "$logtail_skip_uri:$logtail_skip_host" $logtail_enabled { + "0:0" 1; + default 0; +} + +ipng_stats_logtail ipng_stats_logtail udp://127.0.0.1:9514 buffer=64k flush=1s if=$logtail_enabled; +``` + +It took me a while to get used to constructions like this. The first map maps a regular expression +on the `$request_uri` to a string (0 or 1) in an O(1) lookup hashtable. The second map maps the +`$host` (or a regexp on the host) as well. Then, some funky boolean math allows me to concatenate +these two strings in a new map, which can yield "0:0" (neither 'skip' matches), or "1:0", (the URI +skip matched, but the host skip did not) "0:1" (URI did not match, but host did), "1:1" (both of the +'skip' matched), once again mapping those to a string (0 or 1), which the `ipng_stats_logtail` +directive can use in its `if=` argument. + +The log lines themselves are buffered in a per-worker memory buffer (64 KB by default) and transmitted as a +single UDP datagram on flush. If no receiver is listening on `127.0.0.1:9514`, the kernel will silently +discard the datagram. No blocking, no error, no disk I/O. This fire-and-forget design is just great: +analytics should never slow down a request. A lost log line is acceptable; a slow request is not. + +### NGINX: Logtail + +But who or what consumes these UDP packets? Enter +[[nginx-logtail](https://git.ipng.ch/ipng/nginx-logtail)], a four-binary Go pipeline that ingests +those log lines and answers "which client prefix is being served 429s right now?" or "which ASN is +sending me most requests to `/ct/api` in the last 6hrs?" I'll just come right out and admit it: this +little program is 100% written and maintained by Claude Code, and I had no hesitation deploying it; +I reviewed every bit of the code before it went into production. The design document is in +[[docs/design.md](https://git.ipng.ch/ipng/nginx-logtail/src/branch/main/docs/design.md)]. + +The four components are: + +- **collector** runs on each nginx host. It receives UDP datagrams from the stats plugin and + maintains in-memory ranked top-K counters across six time windows (1m, 5m, 15m, 60m, 6h, 24h). + It exposes a gRPC endpoint and rolls up its log counters into a Prometheus `/metrics` endpoint. +- **aggregator** runs on a central host. It subscribes to all collectors' snapshots via streaming + gRPC and serves a merged view using the same gRPC interface. +- **CLI** (`nginx-logtail`) allows one-off queries from the shell, against any collector or the + aggregator, and can output JSON or text. +- **frontend** is an HTTP dashboard with drilldown tables and SVG sparklines; server-rendered HTML, + zero JavaScript, because again, why not? + +The data model is a 7-tuple: `(website, client_prefix, uri, status, is_tor, asn, source_tag)`, +mapped to a 64-bit request count. Client IPs are truncated to /24 (IPv4) or /48 (IPv6) prefixes +at ingest, which keeps cardinality bounded even during DDoS events with millions of source IPs. The +`source_tag` dimension is the attribution tag from `$ipng_source_tag`, which is how the logtail +data can be filtered by maglev frontend. Isn't that cool?! + +#### Backfill on Aggregator Restart + +While developing the logtail, I noticed that restarting the aggregator would (obviously) mean losing +24 hours of historical data. To avoid this, the aggregator calls `DumpSnapshots` on each collector +at startup. Each collector streams its entire fine (1-minute) and coarse (5-minute+) ring buffer +contents back to the aggregator, which merges them into its own rings. The backfill is concurrent +across all collectors and happens before the aggregator's HTTP endpoint starts serving. From a user +perspective, an aggregator restart is invisible: the dashboard shows historical data from the full +retention window immediately, all at the expense of a few gigs of network traffic on IPng's +backbone. + +#### CLI Examples + +The CLI is a quick tool for operational triage: + +```sh +# Top 20 client prefixes by request count in the last 5 minutes. +nginx-logtail topn --target agg:9091 --window 5m --group-by prefix --n 20 + +# Which client prefixes are receiving the most 429s right now? +nginx-logtail topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20 + +# Is traffic from a specific maglev frontend distributed normally across websites? +nginx-logtail topn --target agg:9091 --window 5m --group-by website --source-tag chbtl2 + +# Which URIs are generating the most 5xx responses in the last hour? +nginx-logtail topn --target agg:9091 --window 60m --group-by uri --status '>=500' + +# Show a time-series trend for errors on one website. +nginx-logtail trend --target agg:9091 --window 5m --website ipng.ch --status '>=400' +``` + +The `--status` flag accepts expressions like `429`, `>=400`, `!=200`, or `<500`. The +`--website-re` and `--uri-re` flags accept RE2 regex patterns. `--json` emits NDJSON for +downstream processing with `jq`. + +#### Frontend + +But who needs CLIs when you can also ask Claude to make web-frontends? The nginx-logtail frontend is +a server-rendered dashboard with no JavaScript. It uses the gRPC endpoints on collector and +aggregator to render top-K tables with inline SVG sparklines showing the request count trend per +time bucket over the last 24 hours, with drill down and filtering based on the 7-tuple above. + +{{< image width="100%" src="/assets/vpp-maglev/logtail-frontend.png" alt="nginx-logtail web frontend" >}} + +Clicking any row in the table adds it as a filter and advances to the next dimension in the +hierarchy: website, client prefix, request URI, HTTP status, ASN, source tag. A breadcrumb strip +above the table shows all active filters; clicking the `x` on any token removes just that filter. +A filter expression box accepts direct text input for filters like `status!=200 AND +website~=mon.ct.ipng.ch` above. The URL encodes the full query state so any view can be bookmarked +or shared. Requests are _quick_, the average being at around 150ms or so. It's proven so useful to +find out who is using what webservice, from where. + +### Prometheus + +Three Prometheus sources cover the system from different angles. They are designed to be used +together; each answers questions the others cannot. + +**Source 1: vpp-maglev** is the health and controlplane view. It exports backend states +(up / down / unknown / paused / disabled), effective weights per pool, and VPP API call outcomes. +This is the authoritative source for "which backends are healthy right now" and "what weight is +VPP actually using for each application server." Dashboards built here answer: _is the system +healthy?_ The [[vpp-maglev docs](https://git.ipng.ch/ipng/vpp-maglev/src/branch/main/docs)] +describe the full metric surface. + +**Source 2: nginx-ipng-stats-plugin** is the traffic volume view. It exports per-`(source_tag, +vip)` request and byte counters from inside nginx. The key metrics are +`nginx_ipng_requests_total` and `nginx_ipng_bytes_out_total`, both labeled `source_tag` and +`vip`, with a `code` label for status class. This layer is deliberately terse and scoped: no +per-client, no per-URI dimensions. Dashboards built here answer: _which maglev frontend is +sending how much traffic to which VIP?_ + +**Source 3: nginx-logtail** (collector) is the high-cardinality view. The collector's Prometheus +endpoint exports per-host request counters, body-size histograms, and request-time histograms, +plus per-`source_tag` rollup counters. The gRPC top-K service answers the "who and what" questions +that Prometheus alone cannot, without the cardinality risk. + +The three sources complement each other for cross-layer diagnostics: + +- If vpp-maglev shows all backends up but `nginx_ipng_requests_total` is zero for a specific + `source_tag`, the maglev frontend stopped forwarding. The BGP announcement may have been + withdrawn, or the GRE tunnel is down. +- If `nginx_ipng_requests_total` is healthy for a VIP but vpp-maglev shows a backend in down + state, the pool failover is working: traffic has moved to the standby pool, and the primary + pool is being drained. +- If vpp-maglev shows a backend as up and the stats plugin shows traffic, but error rates in + nginx-logtail are climbing, the application itself is struggling, not the load balancer. + +{{< image width="100%" src="/assets/vpp-maglev/grafana-dashboard.png" alt="Grafana dashboard" >}} + +The Grafana dashboard combines all three sources. The top panel shows per-maglev-frontend request +rates from `nginx_ipng_requests_total`, so I can see at a glance which of the Maglev frontends is +busiest and whether the distribution between them looks right. Backend health state from +vpp-maglev is overlaid as annotations: a backend going down appears as a vertical band on the +traffic panel at the exact moment the traffic redistributed. + +Good observability consists of both metrics/analytics as well as signals. Two alerting rules I find +particularly useful: + +```yaml +groups: +- name: maglev + rules: + - alert: NoTrafficFromMaglevFrontend + expr: | + sum by (source_tag) ( + rate(nginx_ipng_requests_total{source_tag!="direct"}[10m]) + ) < 1 + for: 10m + annotations: + summary: "Maglev frontend {{ $labels.source_tag }} is sourcing de-minimis traffic" + description: "Check anycast announcements and GRE tunnel state for {{ $labels.source_tag }}" + + - alert: NoTrafficToVIP + expr: | + sum by (vip) ( + rate(nginx_ipng_requests_total[10m]) + ) < 1 + for: 10m + annotations: + summary: "VIP {{ $labels.vip }} is receiving de-minimis traffic from any source" + description: "Check anycast announcements; no maglev frontend is forwarding to this VIP" +``` + +`NoTrafficFromMaglevFrontend` fires if a specific Maglev frontend goes silent for ten minutes, where +silent here means less than 1.0 qps of traffic coming from it. This is distinct from a backend +going down: it means the maglev machine itself has stopped forwarding, which is a network event +(remember, it's always BGP!) rather than an application event. + +`NoTrafficToVIP` fires if a VIP receives no traffic from any Maglev frontend. This would be pretty +bad, as `l.ipng.ch` is advertising the VIP via A/AAAA records (remember, it's always DNS!), so if +the VIP is not receiving any traffic from any Maglev source at all, that would be a fairly serious +situation that warrants a page. + +## Results + +The [[nginx-logtail](https://git.ipng.ch/ipng/nginx-logtail)] service has been running for about +three months now. Originally it used a literal file to tail the [[Static CT Logs](/s/ct/)] as they +were being scraped by some abusive user. Using logfiles had a nice benefit of not needing any +changes to nginx at all, just a bunch of repeated `access_log` statements referring to the custom +`log_format`. + +Then I added the [[nginx-ipng-stats-plugin]](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin)] +which has been running now for about four weeks in production, and gives a lot of very useful stats +information. The [[vpp-maglev](https://git.ipng.ch/ipng/vpp-maglev)] project is in pretty good +shape, and has been running for about the same time (one month or so) as well. + +On May 1st, we celebrated Labor day here in Switzerland. It seemed like an as good a day as any to +move all web properties at IPng Networks over to the `l.ipng.ch` VIPs, funneling all traffic through +redundantly announced Maglev instances into redundantly conneted nginx frontends. For about two +weeks now, things have been completely find - knock on wood - but it's safe to say IPng ❤️ Maglev. + +## What's Next + +The VPP routers in AS8298 and also the nginx frontends all perform `sflow` sampling, using the +[[sFlow]({{< ref 2025-02-08-sflow-3 >}})] implementation I worked on last year. I'm pretty confident +that, given the `sFlow` packet data, and near real time `logtail` request data, I should be able to +detect abuse / ddos and other failure scenarios. I think my next project will be to create some form +of nginx plugin that allows me to rate limit or drop abusive client IP addresses programmatically, +based on these signals. Similarly, being able to feed (very) abusive IP prefixes into BGP Flowspec +and having them simply dropped at the VPP Maglev frontend (rather than forwarded to the nginx +frontends), sounds like another fun thing to toy with. + +But for now, I'm content with the progress in IPng's web serving infrastructure. + diff --git a/static/assets/vpp-maglev/grafana-dashboard.png b/static/assets/vpp-maglev/grafana-dashboard.png new file mode 100644 index 0000000..c1f5bdc --- /dev/null +++ b/static/assets/vpp-maglev/grafana-dashboard.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:55920b5ec615822463cfd2897657b144b99778f033765b4ce97e8da4688218d2 +size 381089 diff --git a/static/assets/vpp-maglev/logtail-frontend.png b/static/assets/vpp-maglev/logtail-frontend.png new file mode 100644 index 0000000..ac49989 --- /dev/null +++ b/static/assets/vpp-maglev/logtail-frontend.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5a618b33174337e9a8fb386ad689fea64ccb97d5ab9413f8d0c8bbfa045d8929 +size 138463