556 lines
30 KiB
Markdown
556 lines
30 KiB
Markdown
---
|
|
date: "2026-05-15T18:22:11Z"
|
|
title: VPP with Maglev Loadbalancing - Part 3
|
|
---
|
|
|
|
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
|
|
|
# About this series
|
|
|
|
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
|
|
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
|
|
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
|
|
are shared between the two.
|
|
|
|
In the [[first article]({{< ref "2026-04-30-vpp-maglev" >}})] of this series, I looked at the Maglev
|
|
algorithm and how it is implemented in VPP. Then, I wrote about a health checking controller called
|
|
[[vpp-maglev](https://git.ipng.ch/ipng/vpp-maglev)], and its architecture, server, client and
|
|
frontend in a [[second piece]({{< ref 2026-05-08-vpp-maglev-2 >}})]. The traffic flows in a somewhat
|
|
more complex way now from users to IPng's webservers, so this article dives into the observability
|
|
that makes this system manageable.
|
|
|
|
## Introduction
|
|
|
|
One might argue the Maglev delivery system is elegant. Traffic comes in over anycasted VIPs, VPP
|
|
hashes each new TCP flow onto an application backend using the Maglev algorithm. The backend
|
|
receives the packet via a GRE6 tunnel and responds directly to the client. Elegant, stateless, fast.
|
|
|
|
{{< image width="10em" float="left" src="/assets/smtp/pulling_hair.png" alt="Pulling Hair" >}}
|
|
|
|
I would argue, this is elegant right up until something explodes, and then all of a sudden the
|
|
system is just an opaque blackbox: VPP operates at the flow level and has no knowledge of individual
|
|
HTTP requests. GRE delivery means nginx sees the client IP but not which maglev sent it. Direct
|
|
Server Return means the response never passes through the original VPP Maglev instance again.
|
|
Nothing in this chain sees the full picture of a single request's journey, which will make me lose
|
|
sleep. And I love sleeping.
|
|
|
|
To close that gap, I need to build an observability layer on top of this system:
|
|
|
|
1. **vpp-maglev** itself exposes backend health state and VPP dataplane counters over Prometheus.
|
|
2. **nginx-ipng-stats-plugin** is an nginx module that attributes each request to the Maglev
|
|
frontend that delivered it, and exports per-VIP, per-frontend counters over Prometheus.
|
|
3. **nginx-logtail** is a Go pipeline that ingests per-request log lines from all nginx instances
|
|
and maintains globally ranked top-K tables across multiple time windows for high-cardinality
|
|
queries, exposed via gRPC and rollup stats, once again, over Prometheus.
|
|
|
|
Before explaining more, I need to get something off my chest. Some readers asked me based on my last
|
|
article, why would I name the header `X-IPng-Frontend` even though it's clearly a maglev backend? It
|
|
turns out, the words "frontend" and "backend" are overloaded, pretty much in any system that has
|
|
more than two components. In a chain (VPP Maglev <-> nginx <-> docker container), nginx is both a
|
|
backend (of the maglev) and a frontend (of the docker container), so I'll use the following terms to
|
|
try to disambiguate:
|
|
|
|
- **maglev frontend**: a VPP machine that announces BGP anycast VIPs and forwards traffic to nginx
|
|
machines via GRE6 tunnels.
|
|
- **nginx frontend**: an nginx machine that receives GRE-wrapped packets, unwraps them, and proxies
|
|
requests to application containers. From Maglev's perspective it is an application server (AS);
|
|
from the web service perspective it is the front door.
|
|
- **nginx backend**: a Docker container running an application like Hugo, Nextcloud, or Immich. This
|
|
is what the user's request ultimately reaches.
|
|
|
|
IPng currently announces two IPv4/IPv6 anycasted VIPs (`vip0.l.ipng.ch` and `vip1.l.ipng.ch`) from
|
|
four maglev frontends (`chbtl2`, `frggh1`, `chlzn1`, `nlams1`) and eight nginx frontends spread
|
|
across Switzerland, France, and the Netherlands. Adding/Removing instances of Maglev and instances
|
|
of nginx is non-intrusive and can be done in mere minutes with Ansible and Kees.
|
|
|
|
### Maglev: GRE Delivery
|
|
|
|
VPP delivers each packet to an nginx machine by wrapping it in a GRE6 tunnel. GRE6 uses IPv6 as the
|
|
outer header regardless of whether the inner packet is IPv4 or IPv6. The per-packet overhead is
|
|
fixed at 44 bytes, the outer IPv6 header taking 40 bytes and the GRE header, assuming no
|
|
key/checksum, weighs in at an additional 4 bytes. A standard 1500-byte client packet wrapped in GRE6
|
|
comes out at 1544 bytes regardless of address family:
|
|
|
|
| Inner packet | Breakdown | Inner L3 size | GRE6 overhead | Wrapped total |
|
|
|---|---|---|---|---|
|
|
| IPv4 TCP | 20B IPv4 + 20B TCP + 1460B payload | 1500 bytes | 40+4 bytes | **1544 bytes** |
|
|
| IPv6 TCP | 40B IPv6 + 20B TCP + 1440B payload | 1500 bytes | 40+4 bytes | **1544 bytes** |
|
|
|
|
I decide to configure the GRE tunnel interfaces on each nginx machine with an MTU of 2026 bytes,
|
|
precisely to accommodate those 1544-byte wrapped packets without fragmentation. The 44-byte
|
|
encapsulation overhead is an internal implementation detail; from the internet's perspective,
|
|
traffic wants to flow at the standard 1500-byte MTU end to end.
|
|
|
|
That last point is why [[MSS clamping](https://en.wikipedia.org/wiki/Maximum_segment_size)] is
|
|
necessary. When nginx sends its SYN-ACK in the three-way handshake, it would normally derive its
|
|
advertised MSS from its own interface MTU. An interface MTU of 2026 would yield an MSS of 1986 bytes
|
|
(IPv4) or 1966 bytes (IPv6), telling the client it can send segments of almost 2000 bytes. Those
|
|
segments would travel across the internet on a standard 1500-byte path and cause fragmentation or
|
|
black-holing before they even reached VPP. No bueno!
|
|
|
|
MSS clamping in the SYN-ACK overrides that interface-derived value and advertises the standard
|
|
internet MSS instead:
|
|
|
|
- IPv4 clients: 1500 - 20 (IPv4) - 20 (TCP) = **1460 bytes**
|
|
- IPv6 clients: 1500 - 40 (IPv6) - 20 (TCP) = **1440 bytes**
|
|
|
|
The client then sends standard-sized segments. Those arrive at VPP as 1500-byte packets, get
|
|
GRE6-wrapped to 1544 bytes, and arrive at the nginx GRE interface well within its 2026-byte MTU.
|
|
On the return path, nginx sends responses directly to the client via DSR at the standard 1500-byte
|
|
MTU.
|
|
|
|
On each nginx, I first allow GRE6 from the Maglev frontends, and apply the MSS clamping on the
|
|
(larger MTU) internet-facing interface `enp1s0f0`. Then, using netplan, I'll create an IP6GRE tunnel
|
|
to each Maglev frontend, using descriptive interface names which reveal who owns the remote side of
|
|
the tunnel:
|
|
|
|
```
|
|
pim@nginx0-chlzn0:~$ sudo ip6tables -A INPUT -p gre -s 2001:678:d78::/96 -j ALLOW
|
|
pim@nginx0-chlzn0:~$ sudo ip6tables -A POSTROUTING -o enp1s0f0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1440
|
|
pim@nginx0-chlzn0:~$ sudo iptables -A POSTROUTING -o enp1s0f0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1460
|
|
pim@nginx0-chlzn0:~$ cat /etc/netplan/01-netcfg.yaml
|
|
network:
|
|
version: 2
|
|
renderer: networkd
|
|
ethernets:
|
|
...
|
|
tunnels:
|
|
chbtl2:
|
|
mode: ip6gre
|
|
mtu: 1544
|
|
local: 2001:678:d78:f::2:0
|
|
remote: 2001:678:d78::e
|
|
addresses: [ 194.1.163.31/32, 2001:678:d78::1:0:1/128, 194.126.235.31/32, 2a0b:dd80::1:0:1/128 ]
|
|
chlzn1:
|
|
mode: ip6gre
|
|
mtu: 1544
|
|
local: 2001:678:d78:f::2:0
|
|
remote: 2001:678:d78::10
|
|
addresses: [ 194.1.163.31/32, 2001:678:d78::1:0:1/128, 194.126.235.31/32, 2a0b:dd80::1:0:1/128 ]
|
|
...
|
|
```
|
|
|
|
With this, I know that traffic arriving on `chbtl2` was forwarded by the `chbtl2` maglev frontend.
|
|
That mapping is the key that makes per-source attribution possible, which is a property that I can
|
|
exploit in an nginx plugin. Clever!
|
|
|
|
### NGINX: Stats Plugin
|
|
|
|
I wrote a tiny nginx module on
|
|
[[nginx-ipng-stats-plugin](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin)] that counts requests
|
|
per VIP, split by which interface (in my case, which GRE tunnel) delivered them. It is packaged as a
|
|
standard Debian `libnginx-mod-http-ipng-stats` package on [[deb.ipng.ch](https://deb.ipng.ch/)] and
|
|
loads into stock upstream nginx without recompiling nginx itself. The design document and user guide
|
|
are in the [[docs/](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin/src/branch/main/docs)]
|
|
directory of the repo.
|
|
|
|
A tcpdump on the GRE tunnel interface confirms the handshake looks correct. Both sides settle on
|
|
the standard internet MSS values despite the larger internal MTU:
|
|
|
|
```
|
|
pim@nginx0-chlzn0:~$ sudo tcpdump -i any -n '(tcp[tcpflags] & tcp-syn) != 0 or (ip6 and ip6[6]=6 and (ip6[13+40] & 2) != 0)'
|
|
05:13:46.547826 chbtl2 In IP 162.19.252.246.39246 > 194.1.163.31.443: Flags [S], seq 2576867891,
|
|
win 64240, options [mss 1460,sackOK,TS val 767241235 ecr 0,nop,wscale 7], length 0
|
|
05:13:46.547860 enp1s0f0 Out IP 194.1.163.31.443 > 162.19.252.246.39246: Flags [S.], seq 3931956759,
|
|
ack 2576867892, win 65142, options [mss 1460,sackOK,TS val 681127624 ecr 767241235,nop,wscale 7], length 0
|
|
|
|
05:13:46.584236 chlzn1 In IP6 2a03:2880:f812:5e::.28858 > 2a0b:dd80::1:0:1.443: Flags [S], seq 36254033,
|
|
win 65535, options [mss 1380,sackOK,TS val 3022307959 ecr 0,nop,wscale 8], length 0
|
|
05:13:46.584300 enp1s0f0 Out IP6 2a0b:dd80::1:0:1.443 > 2a03:2880:f812:5e::.28858: Flags [S.], seq 3586034356,
|
|
ack 36254034, win 64482, options [mss 1440,sackOK,TS val 1977557327 ecr 3022307959,nop,wscale 7], length 0
|
|
```
|
|
|
|
The thing to look for here, is the IPv4 address from OVH was told in the packet going out on
|
|
`enp1s0f0` that we are happy to take MSS of 1460 for this IPv4 TCP connection, despite us having a
|
|
larger MTU on the interface. Similarly, the IPv6 address from Meta was told we'll accept MSS of
|
|
1440. That's MSS clamping at work!
|
|
|
|
#### A curious case of `SO_BINDTODEVICE` versus `IP_PKTINFO`
|
|
|
|
Attributing a connection to its ingress interface sounds simple: bind each listening socket to a
|
|
specific interface with `SO_BINDTODEVICE`. Traffic arriving on `chbtl2` goes to that socket;
|
|
traffic on `nlams1` goes to another socket; attribution falls out of which socket accepted the
|
|
connection. And this approach signaled my first failure :-)
|
|
|
|
The problem is that `SO_BINDTODEVICE` affects both ingress _and egress_. A socket bound to
|
|
`chbtl2` device will try to route its return traffic back through that GRE tunnel, back to the
|
|
Maglev machine. That completely breaks direct server return, which requires responses to leave via
|
|
the default gateway (the internet-facing NIC). `SO_BINDTODEVICE` on a GRE interface and DSR are
|
|
mutually exclusive. Yikes.
|
|
|
|
{{< image width="8em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
|
|
|
But Linux has another trick up its kernelistic sleeve: `IP_PKTINFO` (for
|
|
[[IPv4](https://man7.org/linux/man-pages/man7/ip.7.html)]) and `IPV6_RECVPKTINFO` (for
|
|
[[IPv6](https://man7.org/linux/man-pages/man7/ipv6.7.html)]). These socket options tell the kernel
|
|
to attach a control message (cmsg) to each accepted connection containing the interface index on
|
|
which the packet arrived. The listening socket itself remains a wildcard bound to no specific
|
|
interface, so outgoing packets follow the normal routing table and leave via the default gateway.
|
|
Attribution comes from reading the cmsg at connection time, not from socket binding. Whoot!
|
|
|
|
In pseudocode, reading the ingress interface from the ancillary data looks like this:
|
|
|
|
```c
|
|
// Enable IP_PKTINFO on the socket at init time.
|
|
setsockopt(fd, IPPROTO_IP, IP_PKTINFO, &one, sizeof(one));
|
|
setsockopt(fd, IPPROTO_IPV6, IPV6_RECVPKTINFO, &one, sizeof(one));
|
|
|
|
// At accept time, read the cmsg to find the ifindex.
|
|
struct msghdr msg = { ... };
|
|
recvmsg(fd, &msg, 0);
|
|
for (struct cmsghdr *cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
|
|
if (cm->cmsg_level == IPPROTO_IP && cm->cmsg_type == IP_PKTINFO) {
|
|
struct in_pktinfo *pki = (struct in_pktinfo *)CMSG_DATA(cm);
|
|
ifindex = pki->ipi_ifindex; // <- the ingress interface
|
|
}
|
|
if (cm->cmsg_level == IPPROTO_IPV6 && cm->cmsg_type == IPV6_PKTINFO) {
|
|
struct in6_pktinfo *pki = (struct in6_pktinfo *)CMSG_DATA(cm);
|
|
ifindex = pki->ipi6_ifindex; // <- the ingress interface
|
|
}
|
|
}
|
|
```
|
|
|
|
The module enables both socket options on every nginx listening socket and reads the ifindex from
|
|
the cmsg on each accepted connection. It then looks the ifindex up in a table built at
|
|
configuration time from the `device=<ifname>` parameters on the `listen` directives. The match
|
|
produces a short attribution tag. Connections that arrive on an interface with no registered
|
|
binding fall back to the configurable default tag (called `direct`), which handles unattributed
|
|
traffic like direct HTTPS connections that bypass Maglev entirely.
|
|
|
|
#### NGINX Plugin
|
|
|
|
Three things are needed in `nginx.conf`: a shared memory zone for counters, device-bound `listen`
|
|
directives that map each GRE interface to a source tag, and a scrape location:
|
|
|
|
```nginx
|
|
http {
|
|
ipng_stats_zone ipng:4m;
|
|
|
|
server {
|
|
# One device-tagged listen per GRE interface per address family.
|
|
# Each 'listen' tells the module which ifname maps to which tag.
|
|
listen 80 device=chbtl2 ipng_source_tag=chbtl2;
|
|
listen [::]:80 device=chbtl2 ipng_source_tag=chbtl2;
|
|
listen 443 device=chbtl2 ipng_source_tag=chbtl2 ssl;
|
|
listen [::]:443 device=chbtl2 ipng_source_tag=chbtl2 ssl;
|
|
listen 80 device=nlams1 ipng_source_tag=nlams1;
|
|
listen [::]:80 device=nlams1 ipng_source_tag=nlams1;
|
|
listen 443 device=nlams1 ipng_source_tag=nlams1 ssl;
|
|
listen [::]:443 device=nlams1 ipng_source_tag=nlams1 ssl;
|
|
# ... repeat for frggh1, chlzn1 ...
|
|
}
|
|
|
|
# Scrape endpoint on a management port.
|
|
server {
|
|
listen 127.0.0.1:9113;
|
|
location = /.well-known/ipng/statsz {
|
|
ipng_stats;
|
|
allow 127.0.0.1;
|
|
deny all;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
The cool trick is, that multiple `device=` listens on the same port are not multiple kernel sockets;
|
|
under this `IP_PKTINFO` model they collapse to a single wildcard socket, and the module distinguishes
|
|
traffic by reading the ifindex from the cmsg. Adding a new VIP is a `server_name` change in nginx;
|
|
adding a new maglev frontend is a two-line append to the listens file. Neither requires a restart.
|
|
|
|
I kept the hot path intentionally minimal. On each request's log phase, the nginx worker increments
|
|
a counter in its own private table. There are no locks; no atomics; nothing fancy. Just an integer
|
|
increment into memory that only this worker ever writes. A periodic timer (default one second) then
|
|
flushes the worker's private deltas into the shared memory zone using atomic adds. The scrape
|
|
handler called `ipng_stats` reads only from shared memory.
|
|
|
|
The module also registers an nginx variable `$ipng_source_tag` that resolves to the attribution
|
|
tag for the current connection. That variable is available in `log_format`, `map`, `add_header`,
|
|
and any other directive that accepts nginx variables, which is how the logtail pipeline gets the
|
|
attribution.
|
|
|
|
Scraping the endpoint confirms attribution is working. With `Accept: text/plain`, the output is
|
|
Prometheus text format; with `Accept: application/json`, it is JSON. Both support `source_tag=`
|
|
and `vip=` query parameters to filter the output:
|
|
|
|
```sh
|
|
pim@nginx0-chlzn0:~$ curl -Ss http://localhost:9113/.well-known/ipng/statsz?source_tag=chbtl2
|
|
nginx_ipng_requests_total{source_tag="chbtl2",vip="194.1.163.31",code="4xx"} 100062
|
|
nginx_ipng_requests_total{source_tag="chbtl2",vip="194.1.163.31",code="2xx"} 14621209
|
|
nginx_ipng_requests_total{source_tag="chbtl2",vip="2001:678:d78::1:0:1",code="4xx"} 6340
|
|
nginx_ipng_requests_total{source_tag="chbtl2",vip="2001:678:d78::1:0:1",code="2xx"} 10339863
|
|
nginx_ipng_bytes_in_sum{source_tag="chbtl2",vip="194.1.163.31"} 1599408141.000
|
|
nginx_ipng_bytes_in_sum{source_tag="chbtl2",vip="2001:678:d78::1:0:1"} 2405616085.000
|
|
nginx_ipng_bytes_out_sum{source_tag="chbtl2",vip="194.1.163.31"} 418826340291.000
|
|
nginx_ipng_bytes_out_sum{source_tag="chbtl2",vip="2001:678:d78::1:0:1"} 47520361606.000
|
|
# ...
|
|
```
|
|
|
|
The `code` label is bucketed into six classes (`1xx`..`5xx` and `unknown`), which keeps
|
|
per-VIP cardinality bounded regardless of how many distinct HTTP status codes appear in the wild
|
|
(and trust me, they *all* occur in the wild). The module also exports request duration histograms
|
|
and upstream response time histograms, but the request/byte counters above are the day-to-day
|
|
workhorses.
|
|
|
|
### NGINX: Log Hook
|
|
|
|
Since I'm looking at all requests in the nginx log phase anyway, I thought perhaps I could go one
|
|
step further. Prometheus counters answer "how much traffic?" but not "from whom?" or "to which
|
|
URI?". Adding per-client-IP or per-URI dimensions to Prometheus would be a catastrophic idea during
|
|
a DDoS: a modest attack with one million source IPs would create one million Prometheus time series
|
|
and cause the monitoring system to be the first casualty of the incident. The C module is
|
|
deliberately narrow by design.
|
|
|
|
Instead, every request emits a structured log line that carries all the high-cardinality
|
|
dimensions. The format used by the stats plugin's logtail integration is:
|
|
|
|
```nginx
|
|
log_format ipng_stats_logtail
|
|
'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status'
|
|
'\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag'
|
|
'\t$server_addr\t$scheme';
|
|
```
|
|
|
|
The `v1\t` prefix is a version tag. When the format needs to evolve, a new version (say `v2`)
|
|
can be added while old emitters are still running; a reader can route each packet to the
|
|
appropriate parser by looking at the version. In case you're curious, these variables like `$is_tor`
|
|
come from a map I maintain with TOR exit nodes, and `$asn` comes from Maxmind GeoIP. Check it out
|
|
[[here](https://www.maxmind.com/en/geoip-databases)].
|
|
|
|
{{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
|
|
|
Adding an `access_log` directive in every `server` or `location` block would be error-prone and
|
|
would miss any newly added vhost. It would also potentially cause a lot of disk activity to log both
|
|
the `logtail` and the regular `access_log`. I decide that my stats plugin will provide an
|
|
`ipng_stats_logtail` directive at the `http` level that fires globally for every request, regardless
|
|
of which server or location handled it. Because why not?
|
|
|
|
To exclude noisy requests like health probes from the logtail stream, I add an idiomatic `if=` parameter
|
|
which evaluates an nginx variable at log phase and suppresses emission when the value is empty or `0`.
|
|
A `map` block is the idiomatic way to build that variable:
|
|
|
|
```nginx
|
|
map $request_uri $logtail_skip_uri {
|
|
~^/\.well-known/ipng 1;
|
|
default 0;
|
|
}
|
|
|
|
map $host $logtail_skip_host {
|
|
maglev.ipng.ch 1;
|
|
default 0;
|
|
}
|
|
|
|
map "$logtail_skip_uri:$logtail_skip_host" $logtail_enabled {
|
|
"0:0" 1;
|
|
default 0;
|
|
}
|
|
|
|
ipng_stats_logtail ipng_stats_logtail udp://127.0.0.1:9514 buffer=64k flush=1s if=$logtail_enabled;
|
|
```
|
|
|
|
It took me a while to get used to constructions like this. The first map maps a regular expression
|
|
on the `$request_uri` to a string (0 or 1) in an O(1) lookup hashtable. The second map maps the
|
|
`$host` (or a regexp on the host) as well. Then, some funky boolean math allows me to concatenate
|
|
these two strings in a new map, which can yield "0:0" (neither 'skip' matches), or "1:0", (the URI
|
|
skip matched, but the host skip did not) "0:1" (URI did not match, but host did), "1:1" (both of the
|
|
'skip' matched), once again mapping those to a string (0 or 1), which the `ipng_stats_logtail`
|
|
directive can use in its `if=` argument.
|
|
|
|
The log lines themselves are buffered in a per-worker memory buffer (64 KB by default) and transmitted as a
|
|
single UDP datagram on flush. If no receiver is listening on `127.0.0.1:9514`, the kernel will silently
|
|
discard the datagram. No blocking, no error, no disk I/O. This fire-and-forget design is just great:
|
|
analytics should never slow down a request. A lost log line is acceptable; a slow request is not.
|
|
|
|
### NGINX: Logtail
|
|
|
|
But who or what consumes these UDP packets? Enter
|
|
[[nginx-logtail](https://git.ipng.ch/ipng/nginx-logtail)], a four-binary Go pipeline that ingests
|
|
those log lines and answers "which client prefix is being served 429s right now?" or "which ASN is
|
|
sending me most requests to `/ct/api` in the last 6hrs?" I'll just come right out and admit it: this
|
|
little program is 100% written and maintained by Claude Code, and I had no hesitation deploying it;
|
|
I reviewed every bit of the code before it went into production. The design document is in
|
|
[[docs/design.md](https://git.ipng.ch/ipng/nginx-logtail/src/branch/main/docs/design.md)].
|
|
|
|
The four components are:
|
|
|
|
- **collector** runs on each nginx host. It receives UDP datagrams from the stats plugin and
|
|
maintains in-memory ranked top-K counters across six time windows (1m, 5m, 15m, 60m, 6h, 24h).
|
|
It exposes a gRPC endpoint and rolls up its log counters into a Prometheus `/metrics` endpoint.
|
|
- **aggregator** runs on a central host. It subscribes to all collectors' snapshots via streaming
|
|
gRPC and serves a merged view using the same gRPC interface.
|
|
- **CLI** (`nginx-logtail`) allows one-off queries from the shell, against any collector or the
|
|
aggregator, and can output JSON or text.
|
|
- **frontend** is an HTTP dashboard with drilldown tables and SVG sparklines; server-rendered HTML,
|
|
zero JavaScript, because again, why not?
|
|
|
|
The data model is a 7-tuple: `(website, client_prefix, uri, status, is_tor, asn, source_tag)`,
|
|
mapped to a 64-bit request count. Client IPs are truncated to /24 (IPv4) or /48 (IPv6) prefixes
|
|
at ingest, which keeps cardinality bounded even during DDoS events with millions of source IPs. The
|
|
`source_tag` dimension is the attribution tag from `$ipng_source_tag`, which is how the logtail
|
|
data can be filtered by maglev frontend. Isn't that cool?!
|
|
|
|
#### Backfill on Aggregator Restart
|
|
|
|
While developing the logtail, I noticed that restarting the aggregator would (obviously) mean losing
|
|
24 hours of historical data. To avoid this, the aggregator calls `DumpSnapshots` on each collector
|
|
at startup. Each collector streams its entire fine (1-minute) and coarse (5-minute+) ring buffer
|
|
contents back to the aggregator, which merges them into its own rings. The backfill is concurrent
|
|
across all collectors and happens before the aggregator's HTTP endpoint starts serving. From a user
|
|
perspective, an aggregator restart is invisible: the dashboard shows historical data from the full
|
|
retention window immediately, all at the expense of a few gigs of network traffic on IPng's
|
|
backbone.
|
|
|
|
#### CLI Examples
|
|
|
|
The CLI is a quick tool for operational triage:
|
|
|
|
```sh
|
|
# Top 20 client prefixes by request count in the last 5 minutes.
|
|
nginx-logtail topn --target agg:9091 --window 5m --group-by prefix --n 20
|
|
|
|
# Which client prefixes are receiving the most 429s right now?
|
|
nginx-logtail topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20
|
|
|
|
# Is traffic from a specific maglev frontend distributed normally across websites?
|
|
nginx-logtail topn --target agg:9091 --window 5m --group-by website --source-tag chbtl2
|
|
|
|
# Which URIs are generating the most 5xx responses in the last hour?
|
|
nginx-logtail topn --target agg:9091 --window 60m --group-by uri --status '>=500'
|
|
|
|
# Show a time-series trend for errors on one website.
|
|
nginx-logtail trend --target agg:9091 --window 5m --website ipng.ch --status '>=400'
|
|
```
|
|
|
|
The `--status` flag accepts expressions like `429`, `>=400`, `!=200`, or `<500`. The
|
|
`--website-re` and `--uri-re` flags accept RE2 regex patterns. `--json` emits NDJSON for
|
|
downstream processing with `jq`.
|
|
|
|
#### Frontend
|
|
|
|
But who needs CLIs when you can also ask Claude to make web-frontends? The nginx-logtail frontend is
|
|
a server-rendered dashboard with no JavaScript. It uses the gRPC endpoints on collector and
|
|
aggregator to render top-K tables with inline SVG sparklines showing the request count trend per
|
|
time bucket over the last 24 hours, with drill down and filtering based on the 7-tuple above.
|
|
|
|
{{< image width="100%" src="/assets/vpp-maglev/logtail-frontend.png" alt="nginx-logtail web frontend" >}}
|
|
|
|
Clicking any row in the table adds it as a filter and advances to the next dimension in the
|
|
hierarchy: website, client prefix, request URI, HTTP status, ASN, source tag. A breadcrumb strip
|
|
above the table shows all active filters; clicking the `x` on any token removes just that filter.
|
|
A filter expression box accepts direct text input for filters like `status!=200 AND
|
|
website~=mon.ct.ipng.ch` above. The URL encodes the full query state so any view can be bookmarked
|
|
or shared. Requests are _quick_, the average being at around 150ms or so. It's proven so useful to
|
|
find out who is using what webservice, from where.
|
|
|
|
### Prometheus
|
|
|
|
Three Prometheus sources cover the system from different angles. They are designed to be used
|
|
together; each answers questions the others cannot.
|
|
|
|
**Source 1: vpp-maglev** is the health and controlplane view. It exports backend states
|
|
(up / down / unknown / paused / disabled), effective weights per pool, and VPP API call outcomes.
|
|
This is the authoritative source for "which backends are healthy right now" and "what weight is
|
|
VPP actually using for each application server." Dashboards built here answer: _is the system
|
|
healthy?_ The [[vpp-maglev docs](https://git.ipng.ch/ipng/vpp-maglev/src/branch/main/docs)]
|
|
describe the full metric surface.
|
|
|
|
**Source 2: nginx-ipng-stats-plugin** is the traffic volume view. It exports per-`(source_tag,
|
|
vip)` request and byte counters from inside nginx. The key metrics are
|
|
`nginx_ipng_requests_total` and `nginx_ipng_bytes_out_total`, both labeled `source_tag` and
|
|
`vip`, with a `code` label for status class. This layer is deliberately terse and scoped: no
|
|
per-client, no per-URI dimensions. Dashboards built here answer: _which maglev frontend is
|
|
sending how much traffic to which VIP?_
|
|
|
|
**Source 3: nginx-logtail** (collector) is the high-cardinality view. The collector's Prometheus
|
|
endpoint exports per-host request counters, body-size histograms, and request-time histograms,
|
|
plus per-`source_tag` rollup counters. The gRPC top-K service answers the "who and what" questions
|
|
that Prometheus alone cannot, without the cardinality risk.
|
|
|
|
The three sources complement each other for cross-layer diagnostics:
|
|
|
|
- If vpp-maglev shows all backends up but `nginx_ipng_requests_total` is zero for a specific
|
|
`source_tag`, the maglev frontend stopped forwarding. The BGP announcement may have been
|
|
withdrawn, or the GRE tunnel is down.
|
|
- If `nginx_ipng_requests_total` is healthy for a VIP but vpp-maglev shows a backend in down
|
|
state, the pool failover is working: traffic has moved to the standby pool, and the primary
|
|
pool is being drained.
|
|
- If vpp-maglev shows a backend as up and the stats plugin shows traffic, but error rates in
|
|
nginx-logtail are climbing, the application itself is struggling, not the load balancer.
|
|
|
|
{{< image width="100%" src="/assets/vpp-maglev/grafana-dashboard.png" alt="Grafana dashboard" >}}
|
|
|
|
The Grafana dashboard combines all three sources. The top panel shows per-maglev-frontend request
|
|
rates from `nginx_ipng_requests_total`, so I can see at a glance which of the Maglev frontends is
|
|
busiest and whether the distribution between them looks right. Backend health state from
|
|
vpp-maglev is overlaid as annotations: a backend going down appears as a vertical band on the
|
|
traffic panel at the exact moment the traffic redistributed.
|
|
|
|
Good observability consists of both metrics/analytics as well as signals. Two alerting rules I find
|
|
particularly useful:
|
|
|
|
```yaml
|
|
groups:
|
|
- name: maglev
|
|
rules:
|
|
- alert: NoTrafficFromMaglevFrontend
|
|
expr: |
|
|
sum by (source_tag) (
|
|
rate(nginx_ipng_requests_total{source_tag!="direct"}[10m])
|
|
) < 1
|
|
for: 10m
|
|
annotations:
|
|
summary: "Maglev frontend {{ $labels.source_tag }} is sourcing de-minimis traffic"
|
|
description: "Check anycast announcements and GRE tunnel state for {{ $labels.source_tag }}"
|
|
|
|
- alert: NoTrafficToVIP
|
|
expr: |
|
|
sum by (vip) (
|
|
rate(nginx_ipng_requests_total[10m])
|
|
) < 1
|
|
for: 10m
|
|
annotations:
|
|
summary: "VIP {{ $labels.vip }} is receiving de-minimis traffic from any source"
|
|
description: "Check anycast announcements; no maglev frontend is forwarding to this VIP"
|
|
```
|
|
|
|
`NoTrafficFromMaglevFrontend` fires if a specific Maglev frontend goes silent for ten minutes, where
|
|
silent here means less than 1.0 qps of traffic coming from it. This is distinct from a backend
|
|
going down: it means the maglev machine itself has stopped forwarding, which is a network event
|
|
(remember, it's always BGP!) rather than an application event.
|
|
|
|
`NoTrafficToVIP` fires if a VIP receives no traffic from any Maglev frontend. This would be pretty
|
|
bad, as `l.ipng.ch` is advertising the VIP via A/AAAA records (remember, it's always DNS!), so if
|
|
the VIP is not receiving any traffic from any Maglev source at all, that would be a fairly serious
|
|
situation that warrants a page.
|
|
|
|
## Results
|
|
|
|
The [[nginx-logtail](https://git.ipng.ch/ipng/nginx-logtail)] service has been running for about
|
|
three months now. Originally it used a literal file to tail the [[Static CT Logs](/s/ct/)] as they
|
|
were being scraped by some abusive user. Using logfiles had a nice benefit of not needing any
|
|
changes to nginx at all, just a bunch of repeated `access_log` statements referring to the custom
|
|
`log_format`.
|
|
|
|
Then I added the [[nginx-ipng-stats-plugin]](https://git.ipng.ch/ipng/nginx-ipng-stats-plugin)]
|
|
which has been running now for about four weeks in production, and gives a lot of very useful stats
|
|
information. The [[vpp-maglev](https://git.ipng.ch/ipng/vpp-maglev)] project is in pretty good
|
|
shape, and has been running for about the same time (one month or so) as well.
|
|
|
|
On May 1st, we celebrated Labor day here in Switzerland. It seemed like an as good a day as any to
|
|
move all web properties at IPng Networks over to the `l.ipng.ch` VIPs, funneling all traffic through
|
|
redundantly announced Maglev instances into redundantly conneted nginx frontends. For about two
|
|
weeks now, things have been completely find - knock on wood - but it's safe to say IPng ❤️ Maglev.
|
|
|
|
## What's Next
|
|
|
|
The VPP routers in AS8298 and also the nginx frontends all perform `sflow` sampling, using the
|
|
[[sFlow]({{< ref 2025-02-08-sflow-3 >}})] implementation I worked on last year. I'm pretty confident
|
|
that, given the `sFlow` packet data, and near real time `logtail` request data, I should be able to
|
|
detect abuse / ddos and other failure scenarios. I think my next project will be to create some form
|
|
of nginx plugin that allows me to rate limit or drop abusive client IP addresses programmatically,
|
|
based on these signals. Similarly, being able to feed (very) abusive IP prefixes into BGP Flowspec
|
|
and having them simply dropped at the VPP Maglev frontend (rather than forwarded to the nginx
|
|
frontends), sounds like another fun thing to toy with.
|
|
|
|
But for now, I'm content with the progress in IPng's web serving infrastructure.
|
|
|