diff --git a/content/articles/2026-05-15-vpp-maglev-3.md b/content/articles/2026-05-15-vpp-maglev-3.md
new file mode 100644
index 0000000..8c9b043
--- /dev/null
+++ b/content/articles/2026-05-15-vpp-maglev-3.md
@@ -0,0 +1,555 @@
+---
+date: "2026-05-15T18:22:11Z"
+title: VPP with Maglev Loadbalancing - Part 3
+---
+
+{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
+
+# About this series
+
+Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
+performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
+_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
+are shared between the two.
+
+In the [[first article]({{< ref "2026-04-30-vpp-maglev" >}})] of this series, I looked at the Maglev
+algorithm and how it is implemented in VPP. Then, I wrote about a health checking controller called
+[[vpp-maglev](https://git.ipng.ch/ipng/vpp-maglev)], and its architecture, server, client and
+frontend in a [[second piece]({{< ref 2026-05-08-vpp-maglev-2 >}})]. The traffic flows in a somewhat
+more complex way now from users to IPng's webservers, so this article dives into the observability
+that makes this system manageable.
+
+## Introduction
+
+One might argue the Maglev delivery system is elegant. Traffic comes in over anycasted VIPs, VPP
+hashes each new TCP flow onto an application backend using the Maglev algorithm. The backend
+receives the packet via a GRE6 tunnel and responds directly to the client. Elegant, stateless, fast.
+
+{{< image width="10em" float="left" src="/assets/smtp/pulling_hair.png" alt="Pulling Hair" >}}
+
+I would argue, this is elegant right up until something explodes, and then all of a sudden the
+system is just an opaque blackbox: VPP operates at the flow level and has no knowledge of individual
+HTTP requests. GRE delivery means nginx sees the client IP but not which maglev sent it. Direct
+Server Return means the response never passes through the original VPP Maglev instance again.
+Nothing in this chain sees the full picture of a single request's journey, which will make me lose
+sleep. And I love sleeping.
+
+To close that gap, I need to build an observability layer on top of this system:
+
+1. **vpp-maglev** itself exposes backend health state and VPP dataplane counters over Prometheus.
+2. **nginx-ipng-stats-plugin** is an nginx module that attributes each request to the Maglev
+   frontend that delivered it, and exports per-VIP, per-frontend counters over Prometheus.
+3. **nginx-logtail** is a Go pipeline that ingests per-request log lines from all nginx instances 
+   and maintains globally ranked top-K tables across multiple time windows for high-cardinality
+   queries, exposed via gRPC and rollup stats, once again, over Prometheus.
+
+Before explaining more, I need to get something off my chest. Some readers asked me based on my last
+article, why would I name the header `X-IPng-Frontend` even though it's clearly a maglev backend? It
+turns out, the words "frontend" and "backend" are overloaded, pretty much in any system that has
+more than two components. In a chain (VPP Maglev <-> nginx <-> docker container), nginx is both a
+backend (of the maglev) and a frontend (of the docker container), so I'll use the following terms to
+try to disambiguate:
+
+- **maglev frontend**: a VPP machine that announces BGP anycast VIPs and forwards traffic to nginx
+  machines via GRE6 tunnels.
+- **nginx frontend**: an nginx machine that receives GRE-wrapped packets, unwraps them, and proxies
+  requests to application containers. From Maglev's perspective it is an application server (AS);
+  from the web service perspective it is the front door.
+- **nginx backend**: a Docker container running an application like Hugo, Nextcloud, or Immich. This
+  is what the user's request ultimately reaches.
+
+IPng currently announces two IPv4/IPv6 anycasted VIPs (`vip0.l.ipng.ch` and `vip1.l.ipng.ch`) from
+four maglev frontends (`chbtl2`, `frggh1`, `chlzn1`, `nlams1`) and eight nginx frontends spread
+across Switzerland, France, and the Netherlands. Adding/Removing instances of Maglev and instances
+of nginx is non-intrusive and can be done in mere minutes with Ansible and Kees.
+
+### Maglev: GRE Delivery
+
+VPP delivers each packet to an nginx machine by wrapping it in a GRE6 tunnel. GRE6 uses IPv6 as the
+outer header regardless of whether the inner packet is IPv4 or IPv6. The per-packet overhead is
+fixed at 44 bytes, the outer IPv6 header taking 40 bytes and the GRE header, assuming no
+key/checksum, weighs in at an additional 4 bytes. A standard 1500-byte client packet wrapped in GRE6
+comes out at 1544 bytes regardless of address family:
+
+| Inner packet | Breakdown | Inner L3 size | GRE6 overhead | Wrapped total |
+|---|---|---|---|---|
+| IPv4 TCP | 20B IPv4 + 20B TCP + 1460B payload | 1500 bytes | 40+4 bytes | **1544 bytes** |
+| IPv6 TCP | 40B IPv6 + 20B TCP + 1440B payload | 1500 bytes | 40+4 bytes | **1544 bytes** |
+
+I decide to configure the GRE tunnel interfaces on each nginx machine with an MTU of 2026 bytes,
+precisely to accommodate those 1544-byte wrapped packets without fragmentation. The 44-byte
+encapsulation overhead is an internal implementation detail; from the internet's perspective,
+traffic wants to flow at the standard 1500-byte MTU end to end.
+
+That last point is why [[MSS clamping](https://en.wikipedia.org/wiki/Maximum_segment_size)] is
+necessary. When nginx sends its SYN-ACK in the three-way handshake, it would normally derive its
+advertised MSS from its own interface MTU. An interface MTU of 2026 would yield an MSS of 1986 bytes
+(IPv4) or 1966 bytes (IPv6), telling the client it can send segments of almost 2000 bytes. Those
+segments would travel across the internet on a standard 1500-byte path and cause fragmentation or
+black-holing before they even reached VPP. No bueno!
+
+MSS clamping in the SYN-ACK overrides that interface-derived value and advertises the standard
+internet MSS instead:
+
+- IPv4 clients: 1500 - 20 (IPv4) - 20 (TCP) = **1460 bytes**
+- IPv6 clients: 1500 - 40 (IPv6) - 20 (TCP) = **1440 bytes**
+
+The client then sends standard-sized segments. Those arrive at VPP as 1500-byte packets, get
+GRE6-wrapped to 1544 bytes, and arrive at the nginx GRE interface well within its 2026-byte MTU.
+On the return path, nginx sends responses directly to the client via DSR at the standard 1500-byte
+MTU.
+
+On each nginx, I first allow GRE6 from the Maglev frontends, and apply the MSS clamping on the
+(larger MTU) internet-facing interface `enp1s0f0`. Then, using netplan, I'll create an IP6GRE tunnel
+to each Maglev frontend, using descriptive interface names which reveal who owns the remote side of
+the tunnel:
+
+```
+pim@nginx0-chlzn0:~$ sudo ip6tables -A INPUT -p gre -s 2001:678:d78::/96 -j ALLOW
+pim@nginx0-chlzn0:~$ sudo ip6tables -A POSTROUTING -o enp1s0f0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1440
+pim@nginx0-chlzn0:~$ sudo iptables  -A POSTROUTING -o enp1s0f0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1460
+pim@nginx0-chlzn0:~$ cat /etc/netplan/01-netcfg.yaml
+network:
+  version: 2
+  renderer: networkd
+  ethernets:
+    ...
+  tunnels: 
+    chbtl2:
+      mode: ip6gre
+      mtu: 1544
+      local: 2001:678:d78:f::2:0
+      remote: 2001:678:d78::e
+      addresses: [ 194.1.163.31/32, 2001:678:d78::1:0:1/128, 194.126.235.31/32, 2a0b:dd80::1:0:1/128 ]
+    chlzn1:
+      mode: ip6gre
+      mtu: 1544
+      local: 2001:678:d78:f::2:0
+      remote: 2001:678:d78::10
+      addresses: [ 194.1.163.31/32, 2001:678:d78::1:0:1/128, 194.126.235.31/32, 2a0b:dd80::1:0:1/128 ]
+    ...
+```
+
+With this, I know that traffic arriving on `chbtl2` was forwarded by the `chbtl2` maglev frontend.
+That mapping is the key that makes per-source attribution possible, which is a property that I can
+exploit in an nginx plugin. Clever!
+
+### NGINX: Stats Plugin
+
+I wrote a tiny nginx module on
+[[nginx-ipng-stats-plugin](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin)] that counts requests
+per VIP, split by which interface (in my case, which GRE tunnel) delivered them. It is packaged as a
+standard Debian `libnginx-mod-http-ipng-stats` package on [[deb.ipng.ch](https://deb.ipng.ch/)] and
+loads into stock upstream nginx without recompiling nginx itself. The design document and user guide
+are in the [[docs/](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin/src/branch/main/docs)]
+directory of the repo.
+
+A tcpdump on the GRE tunnel interface confirms the handshake looks correct. Both sides settle on
+the standard internet MSS values despite the larger internal MTU:
+
+```
+pim@nginx0-chlzn0:~$ sudo tcpdump -i any -n '(tcp[tcpflags] & tcp-syn) != 0 or (ip6 and ip6[6]=6 and (ip6[13+40] & 2) != 0)'
+05:13:46.547826 chbtl2 In  IP 162.19.252.246.39246 > 194.1.163.31.443: Flags [S], seq 2576867891,
+    win 64240, options [mss 1460,sackOK,TS val 767241235 ecr 0,nop,wscale 7], length 0
+05:13:46.547860 enp1s0f0 Out IP 194.1.163.31.443 > 162.19.252.246.39246: Flags [S.], seq 3931956759,
+    ack 2576867892, win 65142, options [mss 1460,sackOK,TS val 681127624 ecr 767241235,nop,wscale 7], length 0
+
+05:13:46.584236 chlzn1 In  IP6 2a03:2880:f812:5e::.28858 > 2a0b:dd80::1:0:1.443: Flags [S], seq 36254033,
+    win 65535, options [mss 1380,sackOK,TS val 3022307959 ecr 0,nop,wscale 8], length 0
+05:13:46.584300 enp1s0f0 Out IP6 2a0b:dd80::1:0:1.443 > 2a03:2880:f812:5e::.28858: Flags [S.], seq 3586034356,
+    ack 36254034, win 64482, options [mss 1440,sackOK,TS val 1977557327 ecr 3022307959,nop,wscale 7], length 0
+```
+
+The thing to look for here, is the IPv4 address from OVH was told in the packet going out on
+`enp1s0f0` that we are happy to take MSS of 1460 for this IPv4 TCP connection, despite us having a
+larger MTU on the interface. Similarly, the IPv6 address from Meta was told we'll accept MSS of
+1440. That's MSS clamping at work!
+
+#### A curious case of `SO_BINDTODEVICE` versus `IP_PKTINFO`
+
+Attributing a connection to its ingress interface sounds simple: bind each listening socket to a
+specific interface with `SO_BINDTODEVICE`. Traffic arriving on `chbtl2` goes to that socket;
+traffic on `nlams1` goes to another socket; attribution falls out of which socket accepted the
+connection. And this approach signaled my first failure :-)
+
+The problem is that `SO_BINDTODEVICE` affects both ingress _and egress_. A socket bound to
+`chbtl2` device will try to route its return traffic back through that GRE tunnel, back to the
+Maglev machine. That completely breaks direct server return, which requires responses to leave via
+the default gateway (the internet-facing NIC). `SO_BINDTODEVICE` on a GRE interface and DSR are
+mutually exclusive. Yikes.
+
+{{< image width="8em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
+
+But Linux has another trick up its kernelistic sleeve: `IP_PKTINFO` (for
+[[IPv4](https://man7.org/linux/man-pages/man7/ip.7.html)]) and `IPV6_RECVPKTINFO` (for
+[[IPv6](https://man7.org/linux/man-pages/man7/ipv6.7.html)]). These socket options tell the kernel
+to attach a control message (cmsg) to each accepted connection containing the interface index on
+which the packet arrived. The listening socket itself remains a wildcard bound to no specific
+interface, so outgoing packets follow the normal routing table and leave via the default gateway.
+Attribution comes from reading the cmsg at connection time, not from socket binding. Whoot!
+
+In pseudocode, reading the ingress interface from the ancillary data looks like this:
+
+```c
+// Enable IP_PKTINFO on the socket at init time.
+setsockopt(fd, IPPROTO_IP,   IP_PKTINFO,        &one, sizeof(one));
+setsockopt(fd, IPPROTO_IPV6, IPV6_RECVPKTINFO,  &one, sizeof(one));
+
+// At accept time, read the cmsg to find the ifindex.
+struct msghdr msg = { ... };
+recvmsg(fd, &msg, 0);
+for (struct cmsghdr *cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+    if (cm->cmsg_level == IPPROTO_IP && cm->cmsg_type == IP_PKTINFO) {
+        struct in_pktinfo *pki = (struct in_pktinfo *)CMSG_DATA(cm);
+        ifindex = pki->ipi_ifindex;  // <- the ingress interface
+    }
+    if (cm->cmsg_level == IPPROTO_IPV6 && cm->cmsg_type == IPV6_PKTINFO) {
+        struct in6_pktinfo *pki = (struct in6_pktinfo *)CMSG_DATA(cm);
+        ifindex = pki->ipi6_ifindex;  // <- the ingress interface
+    }
+}
+```
+
+The module enables both socket options on every nginx listening socket and reads the ifindex from
+the cmsg on each accepted connection. It then looks the ifindex up in a table built at
+configuration time from the `device=<ifname>` parameters on the `listen` directives. The match
+produces a short attribution tag. Connections that arrive on an interface with no registered
+binding fall back to the configurable default tag (called `direct`), which handles unattributed
+traffic like direct HTTPS connections that bypass Maglev entirely.
+
+#### NGINX Plugin
+
+Three things are needed in `nginx.conf`: a shared memory zone for counters, device-bound `listen`
+directives that map each GRE interface to a source tag, and a scrape location:
+
+```nginx
+http {
+    ipng_stats_zone ipng:4m;
+
+    server {
+        # One device-tagged listen per GRE interface per address family.
+        # Each 'listen' tells the module which ifname maps to which tag.
+        listen 80       device=chbtl2 ipng_source_tag=chbtl2;
+        listen [::]:80  device=chbtl2 ipng_source_tag=chbtl2;
+        listen 443      device=chbtl2 ipng_source_tag=chbtl2 ssl;
+        listen [::]:443 device=chbtl2 ipng_source_tag=chbtl2 ssl;
+        listen 80       device=nlams1 ipng_source_tag=nlams1;
+        listen [::]:80  device=nlams1 ipng_source_tag=nlams1;
+        listen 443      device=nlams1 ipng_source_tag=nlams1 ssl;
+        listen [::]:443 device=nlams1 ipng_source_tag=nlams1 ssl;
+        # ... repeat for frggh1, chlzn1 ...
+    }
+
+    # Scrape endpoint on a management port.
+    server {
+        listen 127.0.0.1:9113;
+        location = /.well-known/ipng/statsz {
+            ipng_stats;
+            allow 127.0.0.1;
+            deny all;
+        }
+    }
+}
+```
+
+The cool trick is, that multiple `device=` listens on the same port are not multiple kernel sockets;
+under this `IP_PKTINFO` model they collapse to a single wildcard socket, and the module distinguishes
+traffic by reading the ifindex from the cmsg. Adding a new VIP is a `server_name` change in nginx;
+adding a new maglev frontend is a two-line append to the listens file. Neither requires a restart.
+
+I kept the hot path intentionally minimal. On each request's log phase, the nginx worker increments
+a counter in its own private table. There are no locks; no atomics; nothing fancy. Just an integer
+increment into memory that only this worker ever writes. A periodic timer (default one second) then
+flushes the worker's private deltas into the shared memory zone using atomic adds. The scrape
+handler called `ipng_stats` reads only from shared memory.
+
+The module also registers an nginx variable `$ipng_source_tag` that resolves to the attribution
+tag for the current connection. That variable is available in `log_format`, `map`, `add_header`,
+and any other directive that accepts nginx variables, which is how the logtail pipeline gets the
+attribution.
+
+Scraping the endpoint confirms attribution is working. With `Accept: text/plain`, the output is
+Prometheus text format; with `Accept: application/json`, it is JSON. Both support `source_tag=`
+and `vip=` query parameters to filter the output:
+
+```sh
+pim@nginx0-chlzn0:~$ curl -Ss http://localhost:9113/.well-known/ipng/statsz?source_tag=chbtl2
+nginx_ipng_requests_total{source_tag="chbtl2",vip="194.1.163.31",code="4xx"} 100062
+nginx_ipng_requests_total{source_tag="chbtl2",vip="194.1.163.31",code="2xx"} 14621209
+nginx_ipng_requests_total{source_tag="chbtl2",vip="2001:678:d78::1:0:1",code="4xx"} 6340
+nginx_ipng_requests_total{source_tag="chbtl2",vip="2001:678:d78::1:0:1",code="2xx"} 10339863
+nginx_ipng_bytes_in_sum{source_tag="chbtl2",vip="194.1.163.31"} 1599408141.000
+nginx_ipng_bytes_in_sum{source_tag="chbtl2",vip="2001:678:d78::1:0:1"} 2405616085.000
+nginx_ipng_bytes_out_sum{source_tag="chbtl2",vip="194.1.163.31"} 418826340291.000
+nginx_ipng_bytes_out_sum{source_tag="chbtl2",vip="2001:678:d78::1:0:1"} 47520361606.000
+# ...
+```
+
+The `code` label is bucketed into six classes (`1xx`..`5xx` and `unknown`), which keeps
+per-VIP cardinality bounded regardless of how many distinct HTTP status codes appear in the wild
+(and trust me, they *all* occur in the wild). The module also exports request duration histograms
+and upstream response time histograms, but the request/byte counters above are the day-to-day
+workhorses.
+
+### NGINX: Log Hook
+
+Since I'm looking at all requests in the nginx log phase anyway, I thought perhaps I could go one
+step further. Prometheus counters answer "how much traffic?" but not "from whom?" or "to which
+URI?". Adding per-client-IP or per-URI dimensions to Prometheus would be a catastrophic idea during
+a DDoS: a modest attack with one million source IPs would create one million Prometheus time series
+and cause the monitoring system to be the first casualty of the incident. The C module is
+deliberately narrow by design.
+
+Instead, every request emits a structured log line that carries all the high-cardinality
+dimensions. The format used by the stats plugin's logtail integration is:
+
+```nginx
+log_format ipng_stats_logtail
+    'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status'
+    '\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag'
+    '\t$server_addr\t$scheme';
+```
+
+The `v1\t` prefix is a version tag. When the format needs to evolve, a new version (say `v2`)
+can be added while old emitters are still running; a reader can route each packet to the
+appropriate parser by looking at the version. In case you're curious, these variables like `$is_tor`
+come from a map I maintain with TOR exit nodes, and `$asn` comes from Maxmind GeoIP. Check it out
+[[here](https://www.maxmind.com/en/geoip-databases)].
+
+{{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
+
+Adding an `access_log` directive in every `server` or `location` block would be error-prone and
+would miss any newly added vhost. It would also potentially cause a lot of disk activity to log both
+the `logtail` and the regular `access_log`.  I decide that my stats plugin will provide an
+`ipng_stats_logtail` directive at the `http` level that fires globally for every request, regardless
+of which server or location handled it. Because why not?
+
+To exclude noisy requests like health probes from the logtail stream, I add an idiomatic `if=` parameter
+which evaluates an nginx variable at log phase and suppresses emission when the value is empty or `0`.
+A `map` block is the idiomatic way to build that variable:
+
+```nginx
+map $request_uri $logtail_skip_uri {
+  ~^/\.well-known/ipng          1;
+  default                       0;
+}
+
+map $host $logtail_skip_host {
+  maglev.ipng.ch                1;
+  default                       0;
+}
+
+map "$logtail_skip_uri:$logtail_skip_host" $logtail_enabled {
+  "0:0"    1;
+  default  0;
+}
+
+ipng_stats_logtail ipng_stats_logtail udp://127.0.0.1:9514 buffer=64k flush=1s if=$logtail_enabled;
+```
+
+It took me a while to get used to constructions like this. The first map maps a regular expression
+on the `$request_uri` to a string (0 or 1) in an O(1) lookup hashtable. The second map maps the
+`$host` (or a regexp on the host) as well. Then, some funky boolean math allows me to concatenate
+these two strings in a new map, which can yield "0:0" (neither 'skip' matches), or "1:0", (the URI
+skip matched, but the host skip did not) "0:1" (URI did not match, but host did), "1:1" (both of the
+'skip' matched), once again mapping those to a string (0 or 1), which the `ipng_stats_logtail`
+directive can use in its `if=` argument.
+
+The log lines themselves are buffered in a per-worker memory buffer (64 KB by default) and transmitted as a
+single UDP datagram on flush. If no receiver is listening on `127.0.0.1:9514`, the kernel will silently
+discard the datagram. No blocking, no error, no disk I/O. This fire-and-forget design is just great:
+analytics should never slow down a request. A lost log line is acceptable; a slow request is not.
+
+### NGINX: Logtail
+
+But who or what consumes these UDP packets? Enter
+[[nginx-logtail](https://git.ipng.ch/ipng/nginx-logtail)], a four-binary Go pipeline that ingests
+those log lines and answers "which client prefix is being served 429s right now?" or "which ASN is
+sending me most requests to `/ct/api` in the last 6hrs?" I'll just come right out and admit it: this
+little program is 100% written and maintained by Claude Code, and I had no hesitation deploying it;
+I reviewed every bit of the code before it went into production. The design document is in
+[[docs/design.md](https://git.ipng.ch/ipng/nginx-logtail/src/branch/main/docs/design.md)].
+
+The four components are:
+
+- **collector** runs on each nginx host. It receives UDP datagrams from the stats plugin and
+  maintains in-memory ranked top-K counters across six time windows (1m, 5m, 15m, 60m, 6h, 24h).
+  It exposes a gRPC endpoint and rolls up its log counters into a Prometheus `/metrics` endpoint.
+- **aggregator** runs on a central host. It subscribes to all collectors' snapshots via streaming
+  gRPC and serves a merged view using the same gRPC interface.
+- **CLI** (`nginx-logtail`) allows one-off queries from the shell, against any collector or the
+  aggregator, and can output JSON or text.
+- **frontend** is an HTTP dashboard with drilldown tables and SVG sparklines; server-rendered HTML,
+  zero JavaScript, because again, why not?
+
+The data model is a 7-tuple: `(website, client_prefix, uri, status, is_tor, asn, source_tag)`,
+mapped to a 64-bit request count. Client IPs are truncated to /24 (IPv4) or /48 (IPv6) prefixes
+at ingest, which keeps cardinality bounded even during DDoS events with millions of source IPs. The
+`source_tag` dimension is the attribution tag from `$ipng_source_tag`, which is how the logtail
+data can be filtered by maglev frontend. Isn't that cool?!
+
+#### Backfill on Aggregator Restart
+
+While developing the logtail, I noticed that restarting the aggregator would (obviously) mean losing
+24 hours of historical data. To avoid this, the aggregator calls `DumpSnapshots` on each collector
+at startup. Each collector streams its entire fine (1-minute) and coarse (5-minute+) ring buffer
+contents back to the aggregator, which merges them into its own rings. The backfill is concurrent
+across all collectors and happens before the aggregator's HTTP endpoint starts serving. From a user
+perspective, an aggregator restart is invisible: the dashboard shows historical data from the full
+retention window immediately, all at the expense of a few gigs of network traffic on IPng's
+backbone.
+
+#### CLI Examples
+
+The CLI is a quick tool for operational triage:
+
+```sh
+# Top 20 client prefixes by request count in the last 5 minutes.
+nginx-logtail topn --target agg:9091 --window 5m --group-by prefix --n 20
+
+# Which client prefixes are receiving the most 429s right now?
+nginx-logtail topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20
+
+# Is traffic from a specific maglev frontend distributed normally across websites?
+nginx-logtail topn --target agg:9091 --window 5m --group-by website --source-tag chbtl2
+
+# Which URIs are generating the most 5xx responses in the last hour?
+nginx-logtail topn --target agg:9091 --window 60m --group-by uri --status '>=500'
+
+# Show a time-series trend for errors on one website.
+nginx-logtail trend --target agg:9091 --window 5m --website ipng.ch --status '>=400'
+```
+
+The `--status` flag accepts expressions like `429`, `>=400`, `!=200`, or `<500`. The
+`--website-re` and `--uri-re` flags accept RE2 regex patterns. `--json` emits NDJSON for
+downstream processing with `jq`.
+
+#### Frontend
+
+But who needs CLIs when you can also ask Claude to make web-frontends? The nginx-logtail frontend is
+a server-rendered dashboard with no JavaScript. It uses the gRPC endpoints on collector and
+aggregator to render top-K tables with inline SVG sparklines showing the request count trend per
+time bucket over the last 24 hours, with drill down and filtering based on the 7-tuple above.
+
+{{< image width="100%" src="/assets/vpp-maglev/logtail-frontend.png" alt="nginx-logtail web frontend" >}}
+
+Clicking any row in the table adds it as a filter and advances to the next dimension in the
+hierarchy: website, client prefix, request URI, HTTP status, ASN, source tag. A breadcrumb strip
+above the table shows all active filters; clicking the `x` on any token removes just that filter.
+A filter expression box accepts direct text input for filters like `status!=200 AND
+website~=mon.ct.ipng.ch` above. The URL encodes the full query state so any view can be bookmarked
+or shared. Requests are _quick_, the average being at around 150ms or so. It's proven so useful to
+find out who is using what webservice, from where.
+
+### Prometheus
+
+Three Prometheus sources cover the system from different angles. They are designed to be used
+together; each answers questions the others cannot.
+
+**Source 1: vpp-maglev** is the health and controlplane view. It exports backend states
+(up / down / unknown / paused / disabled), effective weights per pool, and VPP API call outcomes.
+This is the authoritative source for "which backends are healthy right now" and "what weight is
+VPP actually using for each application server." Dashboards built here answer: _is the system
+healthy?_ The [[vpp-maglev docs](https://git.ipng.ch/ipng/vpp-maglev/src/branch/main/docs)]
+describe the full metric surface.
+
+**Source 2: nginx-ipng-stats-plugin** is the traffic volume view. It exports per-`(source_tag,
+vip)` request and byte counters from inside nginx. The key metrics are
+`nginx_ipng_requests_total` and `nginx_ipng_bytes_out_total`, both labeled `source_tag` and
+`vip`, with a `code` label for status class. This layer is deliberately terse and scoped: no
+per-client, no per-URI dimensions. Dashboards built here answer: _which maglev frontend is
+sending how much traffic to which VIP?_
+
+**Source 3: nginx-logtail** (collector) is the high-cardinality view. The collector's Prometheus
+endpoint exports per-host request counters, body-size histograms, and request-time histograms,
+plus per-`source_tag` rollup counters. The gRPC top-K service answers the "who and what" questions
+that Prometheus alone cannot, without the cardinality risk.
+
+The three sources complement each other for cross-layer diagnostics:
+
+- If vpp-maglev shows all backends up but `nginx_ipng_requests_total` is zero for a specific
+  `source_tag`, the maglev frontend stopped forwarding. The BGP announcement may have been
+  withdrawn, or the GRE tunnel is down.
+- If `nginx_ipng_requests_total` is healthy for a VIP but vpp-maglev shows a backend in down
+  state, the pool failover is working: traffic has moved to the standby pool, and the primary
+  pool is being drained.
+- If vpp-maglev shows a backend as up and the stats plugin shows traffic, but error rates in
+  nginx-logtail are climbing, the application itself is struggling, not the load balancer.
+
+{{< image width="100%" src="/assets/vpp-maglev/grafana-dashboard.png" alt="Grafana dashboard" >}}
+
+The Grafana dashboard combines all three sources. The top panel shows per-maglev-frontend request
+rates from `nginx_ipng_requests_total`, so I can see at a glance which of the Maglev frontends is
+busiest and whether the distribution between them looks right. Backend health state from
+vpp-maglev is overlaid as annotations: a backend going down appears as a vertical band on the
+traffic panel at the exact moment the traffic redistributed.
+
+Good observability consists of both metrics/analytics as well as signals. Two alerting rules I find
+particularly useful:
+
+```yaml
+groups:
+- name: maglev
+  rules:
+  - alert: NoTrafficFromMaglevFrontend
+    expr: |
+      sum by (source_tag) (
+        rate(nginx_ipng_requests_total{source_tag!="direct"}[10m])
+      ) < 1
+    for: 10m
+    annotations:
+      summary: "Maglev frontend {{ $labels.source_tag }} is sourcing de-minimis traffic"
+      description: "Check anycast announcements and GRE tunnel state for {{ $labels.source_tag }}"
+
+  - alert: NoTrafficToVIP
+    expr: |
+      sum by (vip) (
+        rate(nginx_ipng_requests_total[10m])
+      ) < 1
+    for: 10m
+    annotations:
+      summary: "VIP {{ $labels.vip }} is receiving de-minimis traffic from any source"
+      description: "Check anycast announcements; no maglev frontend is forwarding to this VIP"
+```
+
+`NoTrafficFromMaglevFrontend` fires if a specific Maglev frontend goes silent for ten minutes, where
+silent here means less than 1.0 qps of traffic coming from it. This is distinct from a backend
+going down: it means the maglev machine itself has stopped forwarding, which is a network event
+(remember, it's always BGP!) rather than an application event.
+
+`NoTrafficToVIP` fires if a VIP receives no traffic from any Maglev frontend. This would be pretty
+bad, as `l.ipng.ch` is advertising the VIP via A/AAAA records (remember, it's always DNS!), so if
+the VIP is not receiving any traffic from any Maglev source at all, that would be a fairly serious
+situation that warrants a page.
+
+## Results
+
+The [[nginx-logtail](https://git.ipng.ch/ipng/nginx-logtail)] service has been running for about
+three months now. Originally it used a literal file to tail the [[Static CT Logs](/s/ct/)] as they
+were being scraped by some abusive user. Using logfiles had a nice benefit of not needing any
+changes to nginx at all, just a bunch of repeated `access_log` statements referring to the custom
+`log_format`.
+
+Then I added the [[nginx-ipng-stats-plugin]](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin)]
+which has been running now for about four weeks in production, and gives a lot of very useful stats
+information. The [[vpp-maglev](https://git.ipng.ch/ipng/vpp-maglev)] project is in pretty good
+shape, and has been running for about the same time (one month or so) as well.
+
+On May 1st, we celebrated Labor day here in Switzerland. It seemed like an as good a day as any to
+move all web properties at IPng Networks over to the `l.ipng.ch` VIPs, funneling all traffic through
+redundantly announced Maglev instances into redundantly conneted nginx frontends. For about two
+weeks now, things have been completely find - knock on wood - but it's safe to say IPng ❤️  Maglev.
+
+## What's Next
+
+The VPP routers in AS8298 and also the nginx frontends all perform `sflow` sampling, using the
+[[sFlow]({{< ref 2025-02-08-sflow-3 >}})] implementation I worked on last year. I'm pretty confident
+that, given the `sFlow` packet data, and near real time `logtail` request data, I should be able to
+detect abuse / ddos and other failure scenarios. I think my next project will be to create some form 
+of nginx plugin that allows me to rate limit or drop abusive client IP addresses programmatically,
+based on these signals. Similarly, being able to feed (very) abusive IP prefixes into BGP Flowspec
+and having them simply dropped at the VPP Maglev frontend (rather than forwarded to the nginx
+frontends), sounds like another fun thing to toy with.
+
+But for now, I'm content with the progress in IPng's web serving infrastructure.
+
diff --git a/static/assets/vpp-maglev/grafana-dashboard.png b/static/assets/vpp-maglev/grafana-dashboard.png
new file mode 100644
index 0000000..c1f5bdc
--- /dev/null
+++ b/static/assets/vpp-maglev/grafana-dashboard.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:55920b5ec615822463cfd2897657b144b99778f033765b4ce97e8da4688218d2
+size 381089
diff --git a/static/assets/vpp-maglev/logtail-frontend.png b/static/assets/vpp-maglev/logtail-frontend.png
new file mode 100644
index 0000000..ac49989
--- /dev/null
+++ b/static/assets/vpp-maglev/logtail-frontend.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5a618b33174337e9a8fb386ad689fea64ccb97d5ab9413f8d0c8bbfa045d8929
+size 138463