Files
ipng.ch/content/articles/2026-05-15-vpp-maglev-3.md
T
pim 76d61f91bc
continuous-integration/drone/push Build is passing
Article 3 on Maglev at IPng
2026-05-01 09:04:08 +02:00

556 lines
30 KiB
Markdown

---
date: "2026-05-15T18:22:11Z"
title: VPP with Maglev Loadbalancing - Part 3
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
In the [[first article]({{< ref "2026-04-30-vpp-maglev" >}})] of this series, I looked at the Maglev
algorithm and how it is implemented in VPP. Then, I wrote about a health checking controller called
[[vpp-maglev](https://git.ipng.ch/ipng/vpp-maglev)], and its architecture, server, client and
frontend in a [[second piece]({{< ref 2026-05-08-vpp-maglev-2 >}})]. The traffic flows in a somewhat
more complex way now from users to IPng's webservers, so this article dives into the observability
that makes this system manageable.
## Introduction
One might argue the Maglev delivery system is elegant. Traffic comes in over anycasted VIPs, VPP
hashes each new TCP flow onto an application backend using the Maglev algorithm. The backend
receives the packet via a GRE6 tunnel and responds directly to the client. Elegant, stateless, fast.
{{< image width="10em" float="left" src="/assets/smtp/pulling_hair.png" alt="Pulling Hair" >}}
I would argue, this is elegant right up until something explodes, and then all of a sudden the
system is just an opaque blackbox: VPP operates at the flow level and has no knowledge of individual
HTTP requests. GRE delivery means nginx sees the client IP but not which maglev sent it. Direct
Server Return means the response never passes through the original VPP Maglev instance again.
Nothing in this chain sees the full picture of a single request's journey, which will make me lose
sleep. And I love sleeping.
To close that gap, I need to build an observability layer on top of this system:
1. **vpp-maglev** itself exposes backend health state and VPP dataplane counters over Prometheus.
2. **nginx-ipng-stats-plugin** is an nginx module that attributes each request to the Maglev
frontend that delivered it, and exports per-VIP, per-frontend counters over Prometheus.
3. **nginx-logtail** is a Go pipeline that ingests per-request log lines from all nginx instances
and maintains globally ranked top-K tables across multiple time windows for high-cardinality
queries, exposed via gRPC and rollup stats, once again, over Prometheus.
Before explaining more, I need to get something off my chest. Some readers asked me based on my last
article, why would I name the header `X-IPng-Frontend` even though it's clearly a maglev backend? It
turns out, the words "frontend" and "backend" are overloaded, pretty much in any system that has
more than two components. In a chain (VPP Maglev <-> nginx <-> docker container), nginx is both a
backend (of the maglev) and a frontend (of the docker container), so I'll use the following terms to
try to disambiguate:
- **maglev frontend**: a VPP machine that announces BGP anycast VIPs and forwards traffic to nginx
machines via GRE6 tunnels.
- **nginx frontend**: an nginx machine that receives GRE-wrapped packets, unwraps them, and proxies
requests to application containers. From Maglev's perspective it is an application server (AS);
from the web service perspective it is the front door.
- **nginx backend**: a Docker container running an application like Hugo, Nextcloud, or Immich. This
is what the user's request ultimately reaches.
IPng currently announces two IPv4/IPv6 anycasted VIPs (`vip0.l.ipng.ch` and `vip1.l.ipng.ch`) from
four maglev frontends (`chbtl2`, `frggh1`, `chlzn1`, `nlams1`) and eight nginx frontends spread
across Switzerland, France, and the Netherlands. Adding/Removing instances of Maglev and instances
of nginx is non-intrusive and can be done in mere minutes with Ansible and Kees.
### Maglev: GRE Delivery
VPP delivers each packet to an nginx machine by wrapping it in a GRE6 tunnel. GRE6 uses IPv6 as the
outer header regardless of whether the inner packet is IPv4 or IPv6. The per-packet overhead is
fixed at 44 bytes, the outer IPv6 header taking 40 bytes and the GRE header, assuming no
key/checksum, weighs in at an additional 4 bytes. A standard 1500-byte client packet wrapped in GRE6
comes out at 1544 bytes regardless of address family:
| Inner packet | Breakdown | Inner L3 size | GRE6 overhead | Wrapped total |
|---|---|---|---|---|
| IPv4 TCP | 20B IPv4 + 20B TCP + 1460B payload | 1500 bytes | 40+4 bytes | **1544 bytes** |
| IPv6 TCP | 40B IPv6 + 20B TCP + 1440B payload | 1500 bytes | 40+4 bytes | **1544 bytes** |
I decide to configure the GRE tunnel interfaces on each nginx machine with an MTU of 2026 bytes,
precisely to accommodate those 1544-byte wrapped packets without fragmentation. The 44-byte
encapsulation overhead is an internal implementation detail; from the internet's perspective,
traffic wants to flow at the standard 1500-byte MTU end to end.
That last point is why [[MSS clamping](https://en.wikipedia.org/wiki/Maximum_segment_size)] is
necessary. When nginx sends its SYN-ACK in the three-way handshake, it would normally derive its
advertised MSS from its own interface MTU. An interface MTU of 2026 would yield an MSS of 1986 bytes
(IPv4) or 1966 bytes (IPv6), telling the client it can send segments of almost 2000 bytes. Those
segments would travel across the internet on a standard 1500-byte path and cause fragmentation or
black-holing before they even reached VPP. No bueno!
MSS clamping in the SYN-ACK overrides that interface-derived value and advertises the standard
internet MSS instead:
- IPv4 clients: 1500 - 20 (IPv4) - 20 (TCP) = **1460 bytes**
- IPv6 clients: 1500 - 40 (IPv6) - 20 (TCP) = **1440 bytes**
The client then sends standard-sized segments. Those arrive at VPP as 1500-byte packets, get
GRE6-wrapped to 1544 bytes, and arrive at the nginx GRE interface well within its 2026-byte MTU.
On the return path, nginx sends responses directly to the client via DSR at the standard 1500-byte
MTU.
On each nginx, I first allow GRE6 from the Maglev frontends, and apply the MSS clamping on the
(larger MTU) internet-facing interface `enp1s0f0`. Then, using netplan, I'll create an IP6GRE tunnel
to each Maglev frontend, using descriptive interface names which reveal who owns the remote side of
the tunnel:
```
pim@nginx0-chlzn0:~$ sudo ip6tables -A INPUT -p gre -s 2001:678:d78::/96 -j ALLOW
pim@nginx0-chlzn0:~$ sudo ip6tables -A POSTROUTING -o enp1s0f0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1440
pim@nginx0-chlzn0:~$ sudo iptables -A POSTROUTING -o enp1s0f0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1460
pim@nginx0-chlzn0:~$ cat /etc/netplan/01-netcfg.yaml
network:
version: 2
renderer: networkd
ethernets:
...
tunnels:
chbtl2:
mode: ip6gre
mtu: 1544
local: 2001:678:d78:f::2:0
remote: 2001:678:d78::e
addresses: [ 194.1.163.31/32, 2001:678:d78::1:0:1/128, 194.126.235.31/32, 2a0b:dd80::1:0:1/128 ]
chlzn1:
mode: ip6gre
mtu: 1544
local: 2001:678:d78:f::2:0
remote: 2001:678:d78::10
addresses: [ 194.1.163.31/32, 2001:678:d78::1:0:1/128, 194.126.235.31/32, 2a0b:dd80::1:0:1/128 ]
...
```
With this, I know that traffic arriving on `chbtl2` was forwarded by the `chbtl2` maglev frontend.
That mapping is the key that makes per-source attribution possible, which is a property that I can
exploit in an nginx plugin. Clever!
### NGINX: Stats Plugin
I wrote a tiny nginx module on
[[nginx-ipng-stats-plugin](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin)] that counts requests
per VIP, split by which interface (in my case, which GRE tunnel) delivered them. It is packaged as a
standard Debian `libnginx-mod-http-ipng-stats` package on [[deb.ipng.ch](https://deb.ipng.ch/)] and
loads into stock upstream nginx without recompiling nginx itself. The design document and user guide
are in the [[docs/](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin/src/branch/main/docs)]
directory of the repo.
A tcpdump on the GRE tunnel interface confirms the handshake looks correct. Both sides settle on
the standard internet MSS values despite the larger internal MTU:
```
pim@nginx0-chlzn0:~$ sudo tcpdump -i any -n '(tcp[tcpflags] & tcp-syn) != 0 or (ip6 and ip6[6]=6 and (ip6[13+40] & 2) != 0)'
05:13:46.547826 chbtl2 In IP 162.19.252.246.39246 > 194.1.163.31.443: Flags [S], seq 2576867891,
win 64240, options [mss 1460,sackOK,TS val 767241235 ecr 0,nop,wscale 7], length 0
05:13:46.547860 enp1s0f0 Out IP 194.1.163.31.443 > 162.19.252.246.39246: Flags [S.], seq 3931956759,
ack 2576867892, win 65142, options [mss 1460,sackOK,TS val 681127624 ecr 767241235,nop,wscale 7], length 0
05:13:46.584236 chlzn1 In IP6 2a03:2880:f812:5e::.28858 > 2a0b:dd80::1:0:1.443: Flags [S], seq 36254033,
win 65535, options [mss 1380,sackOK,TS val 3022307959 ecr 0,nop,wscale 8], length 0
05:13:46.584300 enp1s0f0 Out IP6 2a0b:dd80::1:0:1.443 > 2a03:2880:f812:5e::.28858: Flags [S.], seq 3586034356,
ack 36254034, win 64482, options [mss 1440,sackOK,TS val 1977557327 ecr 3022307959,nop,wscale 7], length 0
```
The thing to look for here, is the IPv4 address from OVH was told in the packet going out on
`enp1s0f0` that we are happy to take MSS of 1460 for this IPv4 TCP connection, despite us having a
larger MTU on the interface. Similarly, the IPv6 address from Meta was told we'll accept MSS of
1440. That's MSS clamping at work!
#### A curious case of `SO_BINDTODEVICE` versus `IP_PKTINFO`
Attributing a connection to its ingress interface sounds simple: bind each listening socket to a
specific interface with `SO_BINDTODEVICE`. Traffic arriving on `chbtl2` goes to that socket;
traffic on `nlams1` goes to another socket; attribution falls out of which socket accepted the
connection. And this approach signaled my first failure :-)
The problem is that `SO_BINDTODEVICE` affects both ingress _and egress_. A socket bound to
`chbtl2` device will try to route its return traffic back through that GRE tunnel, back to the
Maglev machine. That completely breaks direct server return, which requires responses to leave via
the default gateway (the internet-facing NIC). `SO_BINDTODEVICE` on a GRE interface and DSR are
mutually exclusive. Yikes.
{{< image width="8em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
But Linux has another trick up its kernelistic sleeve: `IP_PKTINFO` (for
[[IPv4](https://man7.org/linux/man-pages/man7/ip.7.html)]) and `IPV6_RECVPKTINFO` (for
[[IPv6](https://man7.org/linux/man-pages/man7/ipv6.7.html)]). These socket options tell the kernel
to attach a control message (cmsg) to each accepted connection containing the interface index on
which the packet arrived. The listening socket itself remains a wildcard bound to no specific
interface, so outgoing packets follow the normal routing table and leave via the default gateway.
Attribution comes from reading the cmsg at connection time, not from socket binding. Whoot!
In pseudocode, reading the ingress interface from the ancillary data looks like this:
```c
// Enable IP_PKTINFO on the socket at init time.
setsockopt(fd, IPPROTO_IP, IP_PKTINFO, &one, sizeof(one));
setsockopt(fd, IPPROTO_IPV6, IPV6_RECVPKTINFO, &one, sizeof(one));
// At accept time, read the cmsg to find the ifindex.
struct msghdr msg = { ... };
recvmsg(fd, &msg, 0);
for (struct cmsghdr *cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
if (cm->cmsg_level == IPPROTO_IP && cm->cmsg_type == IP_PKTINFO) {
struct in_pktinfo *pki = (struct in_pktinfo *)CMSG_DATA(cm);
ifindex = pki->ipi_ifindex; // <- the ingress interface
}
if (cm->cmsg_level == IPPROTO_IPV6 && cm->cmsg_type == IPV6_PKTINFO) {
struct in6_pktinfo *pki = (struct in6_pktinfo *)CMSG_DATA(cm);
ifindex = pki->ipi6_ifindex; // <- the ingress interface
}
}
```
The module enables both socket options on every nginx listening socket and reads the ifindex from
the cmsg on each accepted connection. It then looks the ifindex up in a table built at
configuration time from the `device=<ifname>` parameters on the `listen` directives. The match
produces a short attribution tag. Connections that arrive on an interface with no registered
binding fall back to the configurable default tag (called `direct`), which handles unattributed
traffic like direct HTTPS connections that bypass Maglev entirely.
#### NGINX Plugin
Three things are needed in `nginx.conf`: a shared memory zone for counters, device-bound `listen`
directives that map each GRE interface to a source tag, and a scrape location:
```nginx
http {
ipng_stats_zone ipng:4m;
server {
# One device-tagged listen per GRE interface per address family.
# Each 'listen' tells the module which ifname maps to which tag.
listen 80 device=chbtl2 ipng_source_tag=chbtl2;
listen [::]:80 device=chbtl2 ipng_source_tag=chbtl2;
listen 443 device=chbtl2 ipng_source_tag=chbtl2 ssl;
listen [::]:443 device=chbtl2 ipng_source_tag=chbtl2 ssl;
listen 80 device=nlams1 ipng_source_tag=nlams1;
listen [::]:80 device=nlams1 ipng_source_tag=nlams1;
listen 443 device=nlams1 ipng_source_tag=nlams1 ssl;
listen [::]:443 device=nlams1 ipng_source_tag=nlams1 ssl;
# ... repeat for frggh1, chlzn1 ...
}
# Scrape endpoint on a management port.
server {
listen 127.0.0.1:9113;
location = /.well-known/ipng/statsz {
ipng_stats;
allow 127.0.0.1;
deny all;
}
}
}
```
The cool trick is, that multiple `device=` listens on the same port are not multiple kernel sockets;
under this `IP_PKTINFO` model they collapse to a single wildcard socket, and the module distinguishes
traffic by reading the ifindex from the cmsg. Adding a new VIP is a `server_name` change in nginx;
adding a new maglev frontend is a two-line append to the listens file. Neither requires a restart.
I kept the hot path intentionally minimal. On each request's log phase, the nginx worker increments
a counter in its own private table. There are no locks; no atomics; nothing fancy. Just an integer
increment into memory that only this worker ever writes. A periodic timer (default one second) then
flushes the worker's private deltas into the shared memory zone using atomic adds. The scrape
handler called `ipng_stats` reads only from shared memory.
The module also registers an nginx variable `$ipng_source_tag` that resolves to the attribution
tag for the current connection. That variable is available in `log_format`, `map`, `add_header`,
and any other directive that accepts nginx variables, which is how the logtail pipeline gets the
attribution.
Scraping the endpoint confirms attribution is working. With `Accept: text/plain`, the output is
Prometheus text format; with `Accept: application/json`, it is JSON. Both support `source_tag=`
and `vip=` query parameters to filter the output:
```sh
pim@nginx0-chlzn0:~$ curl -Ss http://localhost:9113/.well-known/ipng/statsz?source_tag=chbtl2
nginx_ipng_requests_total{source_tag="chbtl2",vip="194.1.163.31",code="4xx"} 100062
nginx_ipng_requests_total{source_tag="chbtl2",vip="194.1.163.31",code="2xx"} 14621209
nginx_ipng_requests_total{source_tag="chbtl2",vip="2001:678:d78::1:0:1",code="4xx"} 6340
nginx_ipng_requests_total{source_tag="chbtl2",vip="2001:678:d78::1:0:1",code="2xx"} 10339863
nginx_ipng_bytes_in_sum{source_tag="chbtl2",vip="194.1.163.31"} 1599408141.000
nginx_ipng_bytes_in_sum{source_tag="chbtl2",vip="2001:678:d78::1:0:1"} 2405616085.000
nginx_ipng_bytes_out_sum{source_tag="chbtl2",vip="194.1.163.31"} 418826340291.000
nginx_ipng_bytes_out_sum{source_tag="chbtl2",vip="2001:678:d78::1:0:1"} 47520361606.000
# ...
```
The `code` label is bucketed into six classes (`1xx`..`5xx` and `unknown`), which keeps
per-VIP cardinality bounded regardless of how many distinct HTTP status codes appear in the wild
(and trust me, they *all* occur in the wild). The module also exports request duration histograms
and upstream response time histograms, but the request/byte counters above are the day-to-day
workhorses.
### NGINX: Log Hook
Since I'm looking at all requests in the nginx log phase anyway, I thought perhaps I could go one
step further. Prometheus counters answer "how much traffic?" but not "from whom?" or "to which
URI?". Adding per-client-IP or per-URI dimensions to Prometheus would be a catastrophic idea during
a DDoS: a modest attack with one million source IPs would create one million Prometheus time series
and cause the monitoring system to be the first casualty of the incident. The C module is
deliberately narrow by design.
Instead, every request emits a structured log line that carries all the high-cardinality
dimensions. The format used by the stats plugin's logtail integration is:
```nginx
log_format ipng_stats_logtail
'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status'
'\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag'
'\t$server_addr\t$scheme';
```
The `v1\t` prefix is a version tag. When the format needs to evolve, a new version (say `v2`)
can be added while old emitters are still running; a reader can route each packet to the
appropriate parser by looking at the version. In case you're curious, these variables like `$is_tor`
come from a map I maintain with TOR exit nodes, and `$asn` comes from Maxmind GeoIP. Check it out
[[here](https://www.maxmind.com/en/geoip-databases)].
{{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
Adding an `access_log` directive in every `server` or `location` block would be error-prone and
would miss any newly added vhost. It would also potentially cause a lot of disk activity to log both
the `logtail` and the regular `access_log`. I decide that my stats plugin will provide an
`ipng_stats_logtail` directive at the `http` level that fires globally for every request, regardless
of which server or location handled it. Because why not?
To exclude noisy requests like health probes from the logtail stream, I add an idiomatic `if=` parameter
which evaluates an nginx variable at log phase and suppresses emission when the value is empty or `0`.
A `map` block is the idiomatic way to build that variable:
```nginx
map $request_uri $logtail_skip_uri {
~^/\.well-known/ipng 1;
default 0;
}
map $host $logtail_skip_host {
maglev.ipng.ch 1;
default 0;
}
map "$logtail_skip_uri:$logtail_skip_host" $logtail_enabled {
"0:0" 1;
default 0;
}
ipng_stats_logtail ipng_stats_logtail udp://127.0.0.1:9514 buffer=64k flush=1s if=$logtail_enabled;
```
It took me a while to get used to constructions like this. The first map maps a regular expression
on the `$request_uri` to a string (0 or 1) in an O(1) lookup hashtable. The second map maps the
`$host` (or a regexp on the host) as well. Then, some funky boolean math allows me to concatenate
these two strings in a new map, which can yield "0:0" (neither 'skip' matches), or "1:0", (the URI
skip matched, but the host skip did not) "0:1" (URI did not match, but host did), "1:1" (both of the
'skip' matched), once again mapping those to a string (0 or 1), which the `ipng_stats_logtail`
directive can use in its `if=` argument.
The log lines themselves are buffered in a per-worker memory buffer (64 KB by default) and transmitted as a
single UDP datagram on flush. If no receiver is listening on `127.0.0.1:9514`, the kernel will silently
discard the datagram. No blocking, no error, no disk I/O. This fire-and-forget design is just great:
analytics should never slow down a request. A lost log line is acceptable; a slow request is not.
### NGINX: Logtail
But who or what consumes these UDP packets? Enter
[[nginx-logtail](https://git.ipng.ch/ipng/nginx-logtail)], a four-binary Go pipeline that ingests
those log lines and answers "which client prefix is being served 429s right now?" or "which ASN is
sending me most requests to `/ct/api` in the last 6hrs?" I'll just come right out and admit it: this
little program is 100% written and maintained by Claude Code, and I had no hesitation deploying it;
I reviewed every bit of the code before it went into production. The design document is in
[[docs/design.md](https://git.ipng.ch/ipng/nginx-logtail/src/branch/main/docs/design.md)].
The four components are:
- **collector** runs on each nginx host. It receives UDP datagrams from the stats plugin and
maintains in-memory ranked top-K counters across six time windows (1m, 5m, 15m, 60m, 6h, 24h).
It exposes a gRPC endpoint and rolls up its log counters into a Prometheus `/metrics` endpoint.
- **aggregator** runs on a central host. It subscribes to all collectors' snapshots via streaming
gRPC and serves a merged view using the same gRPC interface.
- **CLI** (`nginx-logtail`) allows one-off queries from the shell, against any collector or the
aggregator, and can output JSON or text.
- **frontend** is an HTTP dashboard with drilldown tables and SVG sparklines; server-rendered HTML,
zero JavaScript, because again, why not?
The data model is a 7-tuple: `(website, client_prefix, uri, status, is_tor, asn, source_tag)`,
mapped to a 64-bit request count. Client IPs are truncated to /24 (IPv4) or /48 (IPv6) prefixes
at ingest, which keeps cardinality bounded even during DDoS events with millions of source IPs. The
`source_tag` dimension is the attribution tag from `$ipng_source_tag`, which is how the logtail
data can be filtered by maglev frontend. Isn't that cool?!
#### Backfill on Aggregator Restart
While developing the logtail, I noticed that restarting the aggregator would (obviously) mean losing
24 hours of historical data. To avoid this, the aggregator calls `DumpSnapshots` on each collector
at startup. Each collector streams its entire fine (1-minute) and coarse (5-minute+) ring buffer
contents back to the aggregator, which merges them into its own rings. The backfill is concurrent
across all collectors and happens before the aggregator's HTTP endpoint starts serving. From a user
perspective, an aggregator restart is invisible: the dashboard shows historical data from the full
retention window immediately, all at the expense of a few gigs of network traffic on IPng's
backbone.
#### CLI Examples
The CLI is a quick tool for operational triage:
```sh
# Top 20 client prefixes by request count in the last 5 minutes.
nginx-logtail topn --target agg:9091 --window 5m --group-by prefix --n 20
# Which client prefixes are receiving the most 429s right now?
nginx-logtail topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20
# Is traffic from a specific maglev frontend distributed normally across websites?
nginx-logtail topn --target agg:9091 --window 5m --group-by website --source-tag chbtl2
# Which URIs are generating the most 5xx responses in the last hour?
nginx-logtail topn --target agg:9091 --window 60m --group-by uri --status '>=500'
# Show a time-series trend for errors on one website.
nginx-logtail trend --target agg:9091 --window 5m --website ipng.ch --status '>=400'
```
The `--status` flag accepts expressions like `429`, `>=400`, `!=200`, or `<500`. The
`--website-re` and `--uri-re` flags accept RE2 regex patterns. `--json` emits NDJSON for
downstream processing with `jq`.
#### Frontend
But who needs CLIs when you can also ask Claude to make web-frontends? The nginx-logtail frontend is
a server-rendered dashboard with no JavaScript. It uses the gRPC endpoints on collector and
aggregator to render top-K tables with inline SVG sparklines showing the request count trend per
time bucket over the last 24 hours, with drill down and filtering based on the 7-tuple above.
{{< image width="100%" src="/assets/vpp-maglev/logtail-frontend.png" alt="nginx-logtail web frontend" >}}
Clicking any row in the table adds it as a filter and advances to the next dimension in the
hierarchy: website, client prefix, request URI, HTTP status, ASN, source tag. A breadcrumb strip
above the table shows all active filters; clicking the `x` on any token removes just that filter.
A filter expression box accepts direct text input for filters like `status!=200 AND
website~=mon.ct.ipng.ch` above. The URL encodes the full query state so any view can be bookmarked
or shared. Requests are _quick_, the average being at around 150ms or so. It's proven so useful to
find out who is using what webservice, from where.
### Prometheus
Three Prometheus sources cover the system from different angles. They are designed to be used
together; each answers questions the others cannot.
**Source 1: vpp-maglev** is the health and controlplane view. It exports backend states
(up / down / unknown / paused / disabled), effective weights per pool, and VPP API call outcomes.
This is the authoritative source for "which backends are healthy right now" and "what weight is
VPP actually using for each application server." Dashboards built here answer: _is the system
healthy?_ The [[vpp-maglev docs](https://git.ipng.ch/ipng/vpp-maglev/src/branch/main/docs)]
describe the full metric surface.
**Source 2: nginx-ipng-stats-plugin** is the traffic volume view. It exports per-`(source_tag,
vip)` request and byte counters from inside nginx. The key metrics are
`nginx_ipng_requests_total` and `nginx_ipng_bytes_out_total`, both labeled `source_tag` and
`vip`, with a `code` label for status class. This layer is deliberately terse and scoped: no
per-client, no per-URI dimensions. Dashboards built here answer: _which maglev frontend is
sending how much traffic to which VIP?_
**Source 3: nginx-logtail** (collector) is the high-cardinality view. The collector's Prometheus
endpoint exports per-host request counters, body-size histograms, and request-time histograms,
plus per-`source_tag` rollup counters. The gRPC top-K service answers the "who and what" questions
that Prometheus alone cannot, without the cardinality risk.
The three sources complement each other for cross-layer diagnostics:
- If vpp-maglev shows all backends up but `nginx_ipng_requests_total` is zero for a specific
`source_tag`, the maglev frontend stopped forwarding. The BGP announcement may have been
withdrawn, or the GRE tunnel is down.
- If `nginx_ipng_requests_total` is healthy for a VIP but vpp-maglev shows a backend in down
state, the pool failover is working: traffic has moved to the standby pool, and the primary
pool is being drained.
- If vpp-maglev shows a backend as up and the stats plugin shows traffic, but error rates in
nginx-logtail are climbing, the application itself is struggling, not the load balancer.
{{< image width="100%" src="/assets/vpp-maglev/grafana-dashboard.png" alt="Grafana dashboard" >}}
The Grafana dashboard combines all three sources. The top panel shows per-maglev-frontend request
rates from `nginx_ipng_requests_total`, so I can see at a glance which of the Maglev frontends is
busiest and whether the distribution between them looks right. Backend health state from
vpp-maglev is overlaid as annotations: a backend going down appears as a vertical band on the
traffic panel at the exact moment the traffic redistributed.
Good observability consists of both metrics/analytics as well as signals. Two alerting rules I find
particularly useful:
```yaml
groups:
- name: maglev
rules:
- alert: NoTrafficFromMaglevFrontend
expr: |
sum by (source_tag) (
rate(nginx_ipng_requests_total{source_tag!="direct"}[10m])
) < 1
for: 10m
annotations:
summary: "Maglev frontend {{ $labels.source_tag }} is sourcing de-minimis traffic"
description: "Check anycast announcements and GRE tunnel state for {{ $labels.source_tag }}"
- alert: NoTrafficToVIP
expr: |
sum by (vip) (
rate(nginx_ipng_requests_total[10m])
) < 1
for: 10m
annotations:
summary: "VIP {{ $labels.vip }} is receiving de-minimis traffic from any source"
description: "Check anycast announcements; no maglev frontend is forwarding to this VIP"
```
`NoTrafficFromMaglevFrontend` fires if a specific Maglev frontend goes silent for ten minutes, where
silent here means less than 1.0 qps of traffic coming from it. This is distinct from a backend
going down: it means the maglev machine itself has stopped forwarding, which is a network event
(remember, it's always BGP!) rather than an application event.
`NoTrafficToVIP` fires if a VIP receives no traffic from any Maglev frontend. This would be pretty
bad, as `l.ipng.ch` is advertising the VIP via A/AAAA records (remember, it's always DNS!), so if
the VIP is not receiving any traffic from any Maglev source at all, that would be a fairly serious
situation that warrants a page.
## Results
The [[nginx-logtail](https://git.ipng.ch/ipng/nginx-logtail)] service has been running for about
three months now. Originally it used a literal file to tail the [[Static CT Logs](/s/ct/)] as they
were being scraped by some abusive user. Using logfiles had a nice benefit of not needing any
changes to nginx at all, just a bunch of repeated `access_log` statements referring to the custom
`log_format`.
Then I added the [[nginx-ipng-stats-plugin]](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin)]
which has been running now for about four weeks in production, and gives a lot of very useful stats
information. The [[vpp-maglev](https://git.ipng.ch/ipng/vpp-maglev)] project is in pretty good
shape, and has been running for about the same time (one month or so) as well.
On May 1st, we celebrated Labor day here in Switzerland. It seemed like an as good a day as any to
move all web properties at IPng Networks over to the `l.ipng.ch` VIPs, funneling all traffic through
redundantly announced Maglev instances into redundantly conneted nginx frontends. For about two
weeks now, things have been completely find - knock on wood - but it's safe to say IPng ❤️ Maglev.
## What's Next
The VPP routers in AS8298 and also the nginx frontends all perform `sflow` sampling, using the
[[sFlow]({{< ref 2025-02-08-sflow-3 >}})] implementation I worked on last year. I'm pretty confident
that, given the `sFlow` packet data, and near real time `logtail` request data, I should be able to
detect abuse / ddos and other failure scenarios. I think my next project will be to create some form
of nginx plugin that allows me to rate limit or drop abusive client IP addresses programmatically,
based on these signals. Similarly, being able to feed (very) abusive IP prefixes into BGP Flowspec
and having them simply dropped at the VPP Maglev frontend (rather than forwarded to the nginx
frontends), sounds like another fun thing to toy with.
But for now, I'm content with the progress in IPng's web serving infrastructure.