ipng/ipng.ch

Fork 0

Files

T

pim 76d61f91bc

continuous-integration/drone/push Build is passing

Details

Article 3 on Maglev at IPng

2026-05-01 09:04:08 +02:00

30 KiB

Raw Blame History

date, title

date	title
2026-05-15T18:22:11Z	VPP with Maglev Loadbalancing - Part 3

{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}

About this series

Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.

In the [[first article]({{< ref "2026-04-30-vpp-maglev" >}})] of this series, I looked at the Maglev algorithm and how it is implemented in VPP. Then, I wrote about a health checking controller called [vpp-maglev], and its architecture, server, client and frontend in a [[second piece]({{< ref 2026-05-08-vpp-maglev-2 >}})]. The traffic flows in a somewhat more complex way now from users to IPng's webservers, so this article dives into the observability that makes this system manageable.

Introduction

One might argue the Maglev delivery system is elegant. Traffic comes in over anycasted VIPs, VPP hashes each new TCP flow onto an application backend using the Maglev algorithm. The backend receives the packet via a GRE6 tunnel and responds directly to the client. Elegant, stateless, fast.

{{< image width="10em" float="left" src="/assets/smtp/pulling_hair.png" alt="Pulling Hair" >}}

I would argue, this is elegant right up until something explodes, and then all of a sudden the system is just an opaque blackbox: VPP operates at the flow level and has no knowledge of individual HTTP requests. GRE delivery means nginx sees the client IP but not which maglev sent it. Direct Server Return means the response never passes through the original VPP Maglev instance again. Nothing in this chain sees the full picture of a single request's journey, which will make me lose sleep. And I love sleeping.

To close that gap, I need to build an observability layer on top of this system:

vpp-maglev itself exposes backend health state and VPP dataplane counters over Prometheus.
nginx-ipng-stats-plugin is an nginx module that attributes each request to the Maglev frontend that delivered it, and exports per-VIP, per-frontend counters over Prometheus.
nginx-logtail is a Go pipeline that ingests per-request log lines from all nginx instances and maintains globally ranked top-K tables across multiple time windows for high-cardinality queries, exposed via gRPC and rollup stats, once again, over Prometheus.

Before explaining more, I need to get something off my chest. Some readers asked me based on my last article, why would I name the header X-IPng-Frontend even though it's clearly a maglev backend? It turns out, the words "frontend" and "backend" are overloaded, pretty much in any system that has more than two components. In a chain (VPP Maglev <-> nginx <-> docker container), nginx is both a backend (of the maglev) and a frontend (of the docker container), so I'll use the following terms to try to disambiguate:

maglev frontend: a VPP machine that announces BGP anycast VIPs and forwards traffic to nginx machines via GRE6 tunnels.
nginx frontend: an nginx machine that receives GRE-wrapped packets, unwraps them, and proxies requests to application containers. From Maglev's perspective it is an application server (AS); from the web service perspective it is the front door.
nginx backend: a Docker container running an application like Hugo, Nextcloud, or Immich. This is what the user's request ultimately reaches.

IPng currently announces two IPv4/IPv6 anycasted VIPs (vip0.l.ipng.ch and vip1.l.ipng.ch) from four maglev frontends (chbtl2, frggh1, chlzn1, nlams1) and eight nginx frontends spread across Switzerland, France, and the Netherlands. Adding/Removing instances of Maglev and instances of nginx is non-intrusive and can be done in mere minutes with Ansible and Kees.

Maglev: GRE Delivery

VPP delivers each packet to an nginx machine by wrapping it in a GRE6 tunnel. GRE6 uses IPv6 as the outer header regardless of whether the inner packet is IPv4 or IPv6. The per-packet overhead is fixed at 44 bytes, the outer IPv6 header taking 40 bytes and the GRE header, assuming no key/checksum, weighs in at an additional 4 bytes. A standard 1500-byte client packet wrapped in GRE6 comes out at 1544 bytes regardless of address family:

Inner packet	Breakdown	Inner L3 size	GRE6 overhead	Wrapped total
IPv4 TCP	20B IPv4 + 20B TCP + 1460B payload	1500 bytes	40+4 bytes	1544 bytes
IPv6 TCP	40B IPv6 + 20B TCP + 1440B payload	1500 bytes	40+4 bytes	1544 bytes

I decide to configure the GRE tunnel interfaces on each nginx machine with an MTU of 2026 bytes, precisely to accommodate those 1544-byte wrapped packets without fragmentation. The 44-byte encapsulation overhead is an internal implementation detail; from the internet's perspective, traffic wants to flow at the standard 1500-byte MTU end to end.

That last point is why [MSS clamping] is necessary. When nginx sends its SYN-ACK in the three-way handshake, it would normally derive its advertised MSS from its own interface MTU. An interface MTU of 2026 would yield an MSS of 1986 bytes (IPv4) or 1966 bytes (IPv6), telling the client it can send segments of almost 2000 bytes. Those segments would travel across the internet on a standard 1500-byte path and cause fragmentation or black-holing before they even reached VPP. No bueno!

MSS clamping in the SYN-ACK overrides that interface-derived value and advertises the standard internet MSS instead:

IPv4 clients: 1500 - 20 (IPv4) - 20 (TCP) = 1460 bytes
IPv6 clients: 1500 - 40 (IPv6) - 20 (TCP) = 1440 bytes

The client then sends standard-sized segments. Those arrive at VPP as 1500-byte packets, get GRE6-wrapped to 1544 bytes, and arrive at the nginx GRE interface well within its 2026-byte MTU. On the return path, nginx sends responses directly to the client via DSR at the standard 1500-byte MTU.

On each nginx, I first allow GRE6 from the Maglev frontends, and apply the MSS clamping on the (larger MTU) internet-facing interface enp1s0f0. Then, using netplan, I'll create an IP6GRE tunnel to each Maglev frontend, using descriptive interface names which reveal who owns the remote side of the tunnel:

pim@nginx0-chlzn0:~$ sudo ip6tables -A INPUT -p gre -s 2001:678:d78::/96 -j ALLOW
pim@nginx0-chlzn0:~$ sudo ip6tables -A POSTROUTING -o enp1s0f0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1440
pim@nginx0-chlzn0:~$ sudo iptables  -A POSTROUTING -o enp1s0f0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1460
pim@nginx0-chlzn0:~$ cat /etc/netplan/01-netcfg.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    ...
  tunnels: 
    chbtl2:
      mode: ip6gre
      mtu: 1544
      local: 2001:678:d78:f::2:0
      remote: 2001:678:d78::e
      addresses: [ 194.1.163.31/32, 2001:678:d78::1:0:1/128, 194.126.235.31/32, 2a0b:dd80::1:0:1/128 ]
    chlzn1:
      mode: ip6gre
      mtu: 1544
      local: 2001:678:d78:f::2:0
      remote: 2001:678:d78::10
      addresses: [ 194.1.163.31/32, 2001:678:d78::1:0:1/128, 194.126.235.31/32, 2a0b:dd80::1:0:1/128 ]
    ...

With this, I know that traffic arriving on chbtl2 was forwarded by the chbtl2 maglev frontend. That mapping is the key that makes per-source attribution possible, which is a property that I can exploit in an nginx plugin. Clever!

NGINX: Stats Plugin

I wrote a tiny nginx module on [nginx-ipng-stats-plugin] that counts requests per VIP, split by which interface (in my case, which GRE tunnel) delivered them. It is packaged as a standard Debian libnginx-mod-http-ipng-stats package on [deb.ipng.ch] and loads into stock upstream nginx without recompiling nginx itself. The design document and user guide are in the [docs/] directory of the repo.

A tcpdump on the GRE tunnel interface confirms the handshake looks correct. Both sides settle on the standard internet MSS values despite the larger internal MTU:

pim@nginx0-chlzn0:~$ sudo tcpdump -i any -n '(tcp[tcpflags] & tcp-syn) != 0 or (ip6 and ip6[6]=6 and (ip6[13+40] & 2) != 0)'
05:13:46.547826 chbtl2 In  IP 162.19.252.246.39246 > 194.1.163.31.443: Flags [S], seq 2576867891,
    win 64240, options [mss 1460,sackOK,TS val 767241235 ecr 0,nop,wscale 7], length 0
05:13:46.547860 enp1s0f0 Out IP 194.1.163.31.443 > 162.19.252.246.39246: Flags [S.], seq 3931956759,
    ack 2576867892, win 65142, options [mss 1460,sackOK,TS val 681127624 ecr 767241235,nop,wscale 7], length 0

05:13:46.584236 chlzn1 In  IP6 2a03:2880:f812:5e::.28858 > 2a0b:dd80::1:0:1.443: Flags [S], seq 36254033,
    win 65535, options [mss 1380,sackOK,TS val 3022307959 ecr 0,nop,wscale 8], length 0
05:13:46.584300 enp1s0f0 Out IP6 2a0b:dd80::1:0:1.443 > 2a03:2880:f812:5e::.28858: Flags [S.], seq 3586034356,
    ack 36254034, win 64482, options [mss 1440,sackOK,TS val 1977557327 ecr 3022307959,nop,wscale 7], length 0

The thing to look for here, is the IPv4 address from OVH was told in the packet going out on enp1s0f0 that we are happy to take MSS of 1460 for this IPv4 TCP connection, despite us having a larger MTU on the interface. Similarly, the IPv6 address from Meta was told we'll accept MSS of 1440. That's MSS clamping at work!

A curious case of `SO_BINDTODEVICE` versus `IP_PKTINFO`

Attributing a connection to its ingress interface sounds simple: bind each listening socket to a specific interface with SO_BINDTODEVICE. Traffic arriving on chbtl2 goes to that socket; traffic on nlams1 goes to another socket; attribution falls out of which socket accepted the connection. And this approach signaled my first failure :-)

The problem is that SO_BINDTODEVICE affects both ingress and egress. A socket bound to chbtl2 device will try to route its return traffic back through that GRE tunnel, back to the Maglev machine. That completely breaks direct server return, which requires responses to leave via the default gateway (the internet-facing NIC). SO_BINDTODEVICE on a GRE interface and DSR are mutually exclusive. Yikes.

{{< image width="8em" float="left" src="/assets/shared/brain.png" alt="brain" >}}

But Linux has another trick up its kernelistic sleeve: IP_PKTINFO (for [IPv4]) and IPV6_RECVPKTINFO (for [IPv6]). These socket options tell the kernel to attach a control message (cmsg) to each accepted connection containing the interface index on which the packet arrived. The listening socket itself remains a wildcard bound to no specific interface, so outgoing packets follow the normal routing table and leave via the default gateway. Attribution comes from reading the cmsg at connection time, not from socket binding. Whoot!

In pseudocode, reading the ingress interface from the ancillary data looks like this:

// Enable IP_PKTINFO on the socket at init time.
setsockopt(fd, IPPROTO_IP,   IP_PKTINFO,        &one, sizeof(one));
setsockopt(fd, IPPROTO_IPV6, IPV6_RECVPKTINFO,  &one, sizeof(one));

// At accept time, read the cmsg to find the ifindex.
struct msghdr msg = { ... };
recvmsg(fd, &msg, 0);
for (struct cmsghdr *cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
    if (cm->cmsg_level == IPPROTO_IP && cm->cmsg_type == IP_PKTINFO) {
        struct in_pktinfo *pki = (struct in_pktinfo *)CMSG_DATA(cm);
        ifindex = pki->ipi_ifindex;  // <- the ingress interface
    }
    if (cm->cmsg_level == IPPROTO_IPV6 && cm->cmsg_type == IPV6_PKTINFO) {
        struct in6_pktinfo *pki = (struct in6_pktinfo *)CMSG_DATA(cm);
        ifindex = pki->ipi6_ifindex;  // <- the ingress interface
    }
}

The module enables both socket options on every nginx listening socket and reads the ifindex from the cmsg on each accepted connection. It then looks the ifindex up in a table built at configuration time from the device=<ifname> parameters on the listen directives. The match produces a short attribution tag. Connections that arrive on an interface with no registered binding fall back to the configurable default tag (called direct), which handles unattributed traffic like direct HTTPS connections that bypass Maglev entirely.

NGINX Plugin

Three things are needed in nginx.conf: a shared memory zone for counters, device-bound listen directives that map each GRE interface to a source tag, and a scrape location:

http {
    ipng_stats_zone ipng:4m;

    server {
        # One device-tagged listen per GRE interface per address family.
        # Each 'listen' tells the module which ifname maps to which tag.
        listen 80       device=chbtl2 ipng_source_tag=chbtl2;
        listen [::]:80  device=chbtl2 ipng_source_tag=chbtl2;
        listen 443      device=chbtl2 ipng_source_tag=chbtl2 ssl;
        listen [::]:443 device=chbtl2 ipng_source_tag=chbtl2 ssl;
        listen 80       device=nlams1 ipng_source_tag=nlams1;
        listen [::]:80  device=nlams1 ipng_source_tag=nlams1;
        listen 443      device=nlams1 ipng_source_tag=nlams1 ssl;
        listen [::]:443 device=nlams1 ipng_source_tag=nlams1 ssl;
        # ... repeat for frggh1, chlzn1 ...
    }

    # Scrape endpoint on a management port.
    server {
        listen 127.0.0.1:9113;
        location = /.well-known/ipng/statsz {
            ipng_stats;
            allow 127.0.0.1;
            deny all;
        }
    }
}

The cool trick is, that multiple device= listens on the same port are not multiple kernel sockets; under this IP_PKTINFO model they collapse to a single wildcard socket, and the module distinguishes traffic by reading the ifindex from the cmsg. Adding a new VIP is a server_name change in nginx; adding a new maglev frontend is a two-line append to the listens file. Neither requires a restart.

I kept the hot path intentionally minimal. On each request's log phase, the nginx worker increments a counter in its own private table. There are no locks; no atomics; nothing fancy. Just an integer increment into memory that only this worker ever writes. A periodic timer (default one second) then flushes the worker's private deltas into the shared memory zone using atomic adds. The scrape handler called ipng_stats reads only from shared memory.

The module also registers an nginx variable $ipng_source_tag that resolves to the attribution tag for the current connection. That variable is available in log_format, map, add_header, and any other directive that accepts nginx variables, which is how the logtail pipeline gets the attribution.

Scraping the endpoint confirms attribution is working. With Accept: text/plain, the output is Prometheus text format; with Accept: application/json, it is JSON. Both support source_tag= and vip= query parameters to filter the output:

pim@nginx0-chlzn0:~$ curl -Ss http://localhost:9113/.well-known/ipng/statsz?source_tag=chbtl2
nginx_ipng_requests_total{source_tag="chbtl2",vip="194.1.163.31",code="4xx"} 100062
nginx_ipng_requests_total{source_tag="chbtl2",vip="194.1.163.31",code="2xx"} 14621209
nginx_ipng_requests_total{source_tag="chbtl2",vip="2001:678:d78::1:0:1",code="4xx"} 6340
nginx_ipng_requests_total{source_tag="chbtl2",vip="2001:678:d78::1:0:1",code="2xx"} 10339863
nginx_ipng_bytes_in_sum{source_tag="chbtl2",vip="194.1.163.31"} 1599408141.000
nginx_ipng_bytes_in_sum{source_tag="chbtl2",vip="2001:678:d78::1:0:1"} 2405616085.000
nginx_ipng_bytes_out_sum{source_tag="chbtl2",vip="194.1.163.31"} 418826340291.000
nginx_ipng_bytes_out_sum{source_tag="chbtl2",vip="2001:678:d78::1:0:1"} 47520361606.000
# ...

The code label is bucketed into six classes (1xx..5xx and unknown), which keeps per-VIP cardinality bounded regardless of how many distinct HTTP status codes appear in the wild (and trust me, they all occur in the wild). The module also exports request duration histograms and upstream response time histograms, but the request/byte counters above are the day-to-day workhorses.

NGINX: Log Hook

Since I'm looking at all requests in the nginx log phase anyway, I thought perhaps I could go one step further. Prometheus counters answer "how much traffic?" but not "from whom?" or "to which URI?". Adding per-client-IP or per-URI dimensions to Prometheus would be a catastrophic idea during a DDoS: a modest attack with one million source IPs would create one million Prometheus time series and cause the monitoring system to be the first casualty of the incident. The C module is deliberately narrow by design.

Instead, every request emits a structured log line that carries all the high-cardinality dimensions. The format used by the stats plugin's logtail integration is:

log_format ipng_stats_logtail
    'v1\t$host\t$remote_addr\t$request_method\t$request_uri\t$status'
    '\t$body_bytes_sent\t$request_time\t$is_tor\t$asn\t$ipng_source_tag'
    '\t$server_addr\t$scheme';

The v1\t prefix is a version tag. When the format needs to evolve, a new version (say v2) can be added while old emitters are still running; a reader can route each packet to the appropriate parser by looking at the version. In case you're curious, these variables like $is_tor come from a map I maintain with TOR exit nodes, and $asn comes from Maxmind GeoIP. Check it out [here].

{{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}

Adding an access_log directive in every server or location block would be error-prone and would miss any newly added vhost. It would also potentially cause a lot of disk activity to log both the logtail and the regular access_log. I decide that my stats plugin will provide an ipng_stats_logtail directive at the http level that fires globally for every request, regardless of which server or location handled it. Because why not?

To exclude noisy requests like health probes from the logtail stream, I add an idiomatic if= parameter which evaluates an nginx variable at log phase and suppresses emission when the value is empty or 0. A map block is the idiomatic way to build that variable:

map $request_uri $logtail_skip_uri {
  ~^/\.well-known/ipng          1;
  default                       0;
}

map $host $logtail_skip_host {
  maglev.ipng.ch                1;
  default                       0;
}

map "$logtail_skip_uri:$logtail_skip_host" $logtail_enabled {
  "0:0"    1;
  default  0;
}

ipng_stats_logtail ipng_stats_logtail udp://127.0.0.1:9514 buffer=64k flush=1s if=$logtail_enabled;

It took me a while to get used to constructions like this. The first map maps a regular expression on the $request_uri to a string (0 or 1) in an O(1) lookup hashtable. The second map maps the $host (or a regexp on the host) as well. Then, some funky boolean math allows me to concatenate these two strings in a new map, which can yield "0:0" (neither 'skip' matches), or "1:0", (the URI skip matched, but the host skip did not) "0:1" (URI did not match, but host did), "1:1" (both of the 'skip' matched), once again mapping those to a string (0 or 1), which the ipng_stats_logtail directive can use in its if= argument.

The log lines themselves are buffered in a per-worker memory buffer (64 KB by default) and transmitted as a single UDP datagram on flush. If no receiver is listening on 127.0.0.1:9514, the kernel will silently discard the datagram. No blocking, no error, no disk I/O. This fire-and-forget design is just great: analytics should never slow down a request. A lost log line is acceptable; a slow request is not.

NGINX: Logtail

But who or what consumes these UDP packets? Enter [nginx-logtail], a four-binary Go pipeline that ingests those log lines and answers "which client prefix is being served 429s right now?" or "which ASN is sending me most requests to /ct/api in the last 6hrs?" I'll just come right out and admit it: this little program is 100% written and maintained by Claude Code, and I had no hesitation deploying it; I reviewed every bit of the code before it went into production. The design document is in [docs/design.md].

The four components are:

collector runs on each nginx host. It receives UDP datagrams from the stats plugin and maintains in-memory ranked top-K counters across six time windows (1m, 5m, 15m, 60m, 6h, 24h). It exposes a gRPC endpoint and rolls up its log counters into a Prometheus /metrics endpoint.
aggregator runs on a central host. It subscribes to all collectors' snapshots via streaming gRPC and serves a merged view using the same gRPC interface.
CLI (nginx-logtail) allows one-off queries from the shell, against any collector or the aggregator, and can output JSON or text.
frontend is an HTTP dashboard with drilldown tables and SVG sparklines; server-rendered HTML, zero JavaScript, because again, why not?

The data model is a 7-tuple: (website, client_prefix, uri, status, is_tor, asn, source_tag), mapped to a 64-bit request count. Client IPs are truncated to /24 (IPv4) or /48 (IPv6) prefixes at ingest, which keeps cardinality bounded even during DDoS events with millions of source IPs. The source_tag dimension is the attribution tag from $ipng_source_tag, which is how the logtail data can be filtered by maglev frontend. Isn't that cool?!

Backfill on Aggregator Restart

While developing the logtail, I noticed that restarting the aggregator would (obviously) mean losing 24 hours of historical data. To avoid this, the aggregator calls DumpSnapshots on each collector at startup. Each collector streams its entire fine (1-minute) and coarse (5-minute+) ring buffer contents back to the aggregator, which merges them into its own rings. The backfill is concurrent across all collectors and happens before the aggregator's HTTP endpoint starts serving. From a user perspective, an aggregator restart is invisible: the dashboard shows historical data from the full retention window immediately, all at the expense of a few gigs of network traffic on IPng's backbone.

CLI Examples

The CLI is a quick tool for operational triage:

# Top 20 client prefixes by request count in the last 5 minutes.
nginx-logtail topn --target agg:9091 --window 5m --group-by prefix --n 20

# Which client prefixes are receiving the most 429s right now?
nginx-logtail topn --target agg:9091 --window 1m --group-by prefix --status 429 --n 20

# Is traffic from a specific maglev frontend distributed normally across websites?
nginx-logtail topn --target agg:9091 --window 5m --group-by website --source-tag chbtl2

# Which URIs are generating the most 5xx responses in the last hour?
nginx-logtail topn --target agg:9091 --window 60m --group-by uri --status '>=500'

# Show a time-series trend for errors on one website.
nginx-logtail trend --target agg:9091 --window 5m --website ipng.ch --status '>=400'

The --status flag accepts expressions like 429, >=400, !=200, or <500. The --website-re and --uri-re flags accept RE2 regex patterns. --json emits NDJSON for downstream processing with jq.

Frontend

But who needs CLIs when you can also ask Claude to make web-frontends? The nginx-logtail frontend is a server-rendered dashboard with no JavaScript. It uses the gRPC endpoints on collector and aggregator to render top-K tables with inline SVG sparklines showing the request count trend per time bucket over the last 24 hours, with drill down and filtering based on the 7-tuple above.

{{< image width="100%" src="/assets/vpp-maglev/logtail-frontend.png" alt="nginx-logtail web frontend" >}}

Clicking any row in the table adds it as a filter and advances to the next dimension in the hierarchy: website, client prefix, request URI, HTTP status, ASN, source tag. A breadcrumb strip above the table shows all active filters; clicking the x on any token removes just that filter. A filter expression box accepts direct text input for filters like status!=200 AND website~=mon.ct.ipng.ch above. The URL encodes the full query state so any view can be bookmarked or shared. Requests are quick, the average being at around 150ms or so. It's proven so useful to find out who is using what webservice, from where.

Prometheus

Three Prometheus sources cover the system from different angles. They are designed to be used together; each answers questions the others cannot.

Source 1: vpp-maglev is the health and controlplane view. It exports backend states (up / down / unknown / paused / disabled), effective weights per pool, and VPP API call outcomes. This is the authoritative source for "which backends are healthy right now" and "what weight is VPP actually using for each application server." Dashboards built here answer: is the system healthy? The [vpp-maglev docs] describe the full metric surface.

Source 2: nginx-ipng-stats-plugin is the traffic volume view. It exports per-(source_tag, vip) request and byte counters from inside nginx. The key metrics are nginx_ipng_requests_total and nginx_ipng_bytes_out_total, both labeled source_tag and vip, with a code label for status class. This layer is deliberately terse and scoped: no per-client, no per-URI dimensions. Dashboards built here answer: which maglev frontend is sending how much traffic to which VIP?

Source 3: nginx-logtail (collector) is the high-cardinality view. The collector's Prometheus endpoint exports per-host request counters, body-size histograms, and request-time histograms, plus per-source_tag rollup counters. The gRPC top-K service answers the "who and what" questions that Prometheus alone cannot, without the cardinality risk.

The three sources complement each other for cross-layer diagnostics:

If vpp-maglev shows all backends up but nginx_ipng_requests_total is zero for a specific source_tag, the maglev frontend stopped forwarding. The BGP announcement may have been withdrawn, or the GRE tunnel is down.
If nginx_ipng_requests_total is healthy for a VIP but vpp-maglev shows a backend in down state, the pool failover is working: traffic has moved to the standby pool, and the primary pool is being drained.
If vpp-maglev shows a backend as up and the stats plugin shows traffic, but error rates in nginx-logtail are climbing, the application itself is struggling, not the load balancer.

{{< image width="100%" src="/assets/vpp-maglev/grafana-dashboard.png" alt="Grafana dashboard" >}}

The Grafana dashboard combines all three sources. The top panel shows per-maglev-frontend request rates from nginx_ipng_requests_total, so I can see at a glance which of the Maglev frontends is busiest and whether the distribution between them looks right. Backend health state from vpp-maglev is overlaid as annotations: a backend going down appears as a vertical band on the traffic panel at the exact moment the traffic redistributed.

Good observability consists of both metrics/analytics as well as signals. Two alerting rules I find particularly useful:

groups:
- name: maglev
  rules:
  - alert: NoTrafficFromMaglevFrontend
    expr: |
      sum by (source_tag) (
        rate(nginx_ipng_requests_total{source_tag!="direct"}[10m])
      ) < 1
    for: 10m
    annotations:
      summary: "Maglev frontend {{ $labels.source_tag }} is sourcing de-minimis traffic"
      description: "Check anycast announcements and GRE tunnel state for {{ $labels.source_tag }}"

  - alert: NoTrafficToVIP
    expr: |
      sum by (vip) (
        rate(nginx_ipng_requests_total[10m])
      ) < 1
    for: 10m
    annotations:
      summary: "VIP {{ $labels.vip }} is receiving de-minimis traffic from any source"
      description: "Check anycast announcements; no maglev frontend is forwarding to this VIP"

NoTrafficFromMaglevFrontend fires if a specific Maglev frontend goes silent for ten minutes, where silent here means less than 1.0 qps of traffic coming from it. This is distinct from a backend going down: it means the maglev machine itself has stopped forwarding, which is a network event (remember, it's always BGP!) rather than an application event.

NoTrafficToVIP fires if a VIP receives no traffic from any Maglev frontend. This would be pretty bad, as l.ipng.ch is advertising the VIP via A/AAAA records (remember, it's always DNS!), so if the VIP is not receiving any traffic from any Maglev source at all, that would be a fairly serious situation that warrants a page.

Results

The [nginx-logtail] service has been running for about three months now. Originally it used a literal file to tail the [Static CT Logs] as they were being scraped by some abusive user. Using logfiles had a nice benefit of not needing any changes to nginx at all, just a bunch of repeated access_log statements referring to the custom log_format.

Then I added the [nginx-ipng-stats-plugin]] which has been running now for about four weeks in production, and gives a lot of very useful stats information. The [vpp-maglev] project is in pretty good shape, and has been running for about the same time (one month or so) as well.

On May 1st, we celebrated Labor day here in Switzerland. It seemed like an as good a day as any to move all web properties at IPng Networks over to the l.ipng.ch VIPs, funneling all traffic through redundantly announced Maglev instances into redundantly conneted nginx frontends. For about two weeks now, things have been completely find - knock on wood - but it's safe to say IPng ❤️ Maglev.

What's Next

The VPP routers in AS8298 and also the nginx frontends all perform sflow sampling, using the [[sFlow]({{< ref 2025-02-08-sflow-3 >}})] implementation I worked on last year. I'm pretty confident that, given the sFlow packet data, and near real time logtail request data, I should be able to detect abuse / ddos and other failure scenarios. I think my next project will be to create some form of nginx plugin that allows me to rate limit or drop abusive client IP addresses programmatically, based on these signals. Similarly, being able to feed (very) abusive IP prefixes into BGP Flowspec and having them simply dropped at the VPP Maglev frontend (rather than forwarded to the nginx frontends), sounds like another fun thing to toy with.

But for now, I'm content with the progress in IPng's web serving infrastructure.

30 KiB Raw Blame History