651 lines
37 KiB
Markdown
651 lines
37 KiB
Markdown
---
|
|
date: "2026-05-08T06:35:14Z"
|
|
title: VPP with Maglev Loadbalancing - Part 2
|
|
---
|
|
|
|
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
|
|
|
# About this series
|
|
|
|
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
|
|
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
|
|
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
|
|
are shared between the two.
|
|
|
|
In a [[previous article]({{< ref "2026-04-30-vpp-maglev" >}})], I looked into the Maglev algorithm and
|
|
how it is implemented in VPP. I fixed a couple of bugs in the API and added features to set weights
|
|
for application server backends. In this article, I am going to describe an approach to a control
|
|
plane for VPP's Maglev plugin.
|
|
|
|
## Introduction
|
|
|
|
For the VPP Maglev plugin to be truly useful, some automation has to govern its use of backends:
|
|
which ones get how much traffic, which ones are unhealthy and need to be drained, and so on.
|
|
Ideally, this control loop is fully automatic: when backends go missing either because they are
|
|
down themselves or because the datacenter they are in decides to take the day off, it would be nice
|
|
if the load balancer notices this and avoids sending traffic there. However, the VPP Maglev plugin
|
|
does not offer any of these smarts. The plugin is a pure dataplane component that can sling packets
|
|
to backends at very high rates, and all the rest is left as an exercise for the reader.
|
|
|
|
## VPP Maglev: Controlplane
|
|
|
|
The core problem is that VPP's `lb` plugin is pure dataplane. It holds a table of VIPs, each with a
|
|
set of application servers and their weights using the feature I added. It then hashes new flows
|
|
deterministically onto those servers. That is cool, but it is all it does. If a backend stops
|
|
responding, VPP does not know and does not care - it will keep sending traffic to that address until
|
|
someone or something tells it otherwise. The result is a black hole: clients trying to establish new
|
|
connections time out while waiting for a backend that will never respond.
|
|
|
|
Before I decided to write `vpp-maglev`, the fix for missing/down backends was manual: watch your
|
|
monitoring dashboards, notice when a backend is down, SSH into the machine running VPP, and use
|
|
`vppctl lb as ... del flush` to remove the dead backend. That works, but it obviously requires a
|
|
human in the loop and introduces a window of failure between the backend going down and the operator
|
|
reacting. For a production load balancer that is supposed to be invisible to users, this is not good
|
|
enough.
|
|
|
|
What IPng needs at a high level, is a controlplane that can:
|
|
|
|
1. Continuously probe each backend and maintain an accurate view of its health.
|
|
1. Translate health state changes into VPP API calls immediately, without human intervention.
|
|
1. Handle edge cases gracefully: what happens when `maglevd` itself restarts? When VPP restarts?
|
|
When a backend is briefly playing _Flappy Bird_?
|
|
1. Expose all of this state through a uniform API so that CLIs, dashboards, and monitoring scripts
|
|
can all read from (and write to) the same source of truth.
|
|
|
|
To address my needs, I decided to write **vpp-maglev**, which ships as four binaries: `maglevd` (the
|
|
controlplane daemon), `maglevc` (a CLI for it), `maglevd-frontend` (a web dashboard for it), and
|
|
`maglevt` (an out-of-band test utility). The rest of this article goes through each one in detail.
|
|
|
|
## Design Principles
|
|
|
|
Before blindly writing code, I wrote down a few of the constraints I wanted to hold true. Wait, a
|
|
design you say? Well, yes! And this design turned out to drive most of the architectural decisions:
|
|
|
|
**One source of truth.** Every component - CLI, web dashboard, alerting scripts - reads `maglevd`
|
|
through one typed gRPC interface. There is no secondary control plane. The CLI and the web dashboard
|
|
show exactly the same state as each other because they both ask the same controlplane daemon.
|
|
|
|
**Restart neutrality.** Restarting `maglevd` while VPP is serving live traffic must not cause user
|
|
interruption or traffic blackholing. A naive implementation would initialize an empty LB state upon
|
|
startup, because at that point the vpp-maglev daemon sees every backend in an initial `unknown` state. I
|
|
need to make sure I design for things like controlplane upgrades from the get-go, so they are safe.
|
|
|
|
**Diff-based reconciliation.** I want to create a VPP sync that computes a desired state from the
|
|
config and current observed health, then diffs it against what VPP already has, issuing only the
|
|
minimum set of API calls to converge. This is not too dissimilar from the approach I took in
|
|
[[vppcfg]({{< ref 2022-03-27-vppcfg-1 >}})], in that running the sync multiple times needs to
|
|
produce the same outcome as running it once.
|
|
|
|
**Structured observability from the start.** Every state change needs to be accounted for in a
|
|
structured JSON log, a Prometheus counter increment, and a streaming gRPC event. All three, every
|
|
time. I find it very frustrating to debug production systems that have ad hoc log messages and no
|
|
metrics, and if it's one thing a life-time career of being an SRE has taught me, it is to set the
|
|
observability bar high early.
|
|
|
|
## Health Checker: `maglevd`
|
|
|
|
`maglevd` is the long-running daemon at the center of everything. It needs to have some initial
|
|
state configuration which needs to be present on the machine, so that cold restarts do not need to
|
|
phone home to get a running config. My first decision is to let it read a YAML configuration
|
|
file that describes three named collections: `healthchecks`, `backends` that reference the health
|
|
checks, and `frontends` that reference the backends.
|
|
|
|
The configuration structure maps directly onto the internal runtime model, sort of like this:
|
|
|
|
```yaml
|
|
maglev:
|
|
healthchecks:
|
|
http-check:
|
|
type: http
|
|
port: 80
|
|
params:
|
|
path: /.well-known/ipng/healthz
|
|
response-code: "200-204"
|
|
interval: 5s
|
|
|
|
backends:
|
|
nginx0-ams:
|
|
address: 192.0.2.10
|
|
healthcheck: http-check
|
|
nginx1-ams:
|
|
address: 192.0.2.11
|
|
healthcheck: http-check
|
|
nginx0-fra:
|
|
address: 192.0.2.12
|
|
healthcheck: http-check
|
|
|
|
frontends:
|
|
http-vip:
|
|
address: 192.0.2.1
|
|
protocol: tcp
|
|
port: 80
|
|
pools:
|
|
- name: primary
|
|
backends:
|
|
nginx0-ams: { weight: 100 }
|
|
nginx1-ams: { weight: 10 }
|
|
- name: fallback
|
|
backends:
|
|
nginx0-fra: {}
|
|
```
|
|
|
|
A **healthcheck** defines how to probe - the protocol, port, success criteria, timing parameters,
|
|
and so on. A **backend** is a named IP address bound to a healthcheck. A **frontend** is a VIP
|
|
address with one or more named **pools**, where each pool is an ordered list of `(backend, weight)`
|
|
tuples. At runtime, each backend gets exactly one probe (which Go lets me use goroutines for),
|
|
regardless of how many frontends reference it, which greatly cuts down on probe traffic.
|
|
|
|
Probes run on the configured schedule and their results flow through a state machine. State
|
|
changes emit events that the reconciler picks up and translates into VPP API calls and gRPC
|
|
streaming events for subscribed clients. The frontend's aggregate state, be it `up`, `down`, or
|
|
`unknown`, is derived from the effective weights of its backends and needs to be updated on every
|
|
backend transition.
|
|
|
|
The Golang `slog` (structured log) package emits machine-consumable JSON directly:
|
|
|
|
```json
|
|
{"level":"INFO","msg":"backend-transition","backend":"nginx0-ams","from":"down","to":"up","code":"L7OK","detail":""}
|
|
{"level":"INFO","msg":"frontend-transition","frontend":"http-vip","from":"down","to":"up"}
|
|
```
|
|
|
|
I don't really have to think about all of this state checking stuff from scratch. There are a few
|
|
really good loadbalancers out there already! One of them is HAProxy, which I used a very long time
|
|
ago. It features a really good health checking approach, the principles of which I am grateful to
|
|
borrow for my own project.
|
|
|
|
### HAProxy: Learning from its Health Counter
|
|
|
|
The state machine is driven by a single integer borrowed from HAProxy's health model: given a `rise`
|
|
threshold and a `fall` threshold, define a counter `health` in the range `[0, rise + fall - 1]`. The
|
|
backend is considered `up` when `health >= rise` and `down` when `health < rise`.
|
|
|
|
On each probe, a pass increments the counter (ceiling at maximum); a failure decrements it (floor
|
|
at zero). This gives **hysteresis**: a backend sitting at the rise boundary needs `fall`
|
|
consecutive failures before it transitions to down, and a fully-down backend needs `rise`
|
|
consecutive passes to come back up. A flapping backend that alternates between passing and failing
|
|
stays in the degraded zone without bouncing between states - which is exactly what I want to
|
|
avoid a storm of VPP API calls from a noisy backend.
|
|
|
|
In _pseudocode_, here's what that simple yet elegant approach looks like:
|
|
|
|
```go
|
|
type HealthCounter struct {
|
|
Health int
|
|
Rise int
|
|
Fall int
|
|
}
|
|
|
|
func (h *HealthCounter) IsUp() bool { return h.Health >= h.Rise }
|
|
|
|
func (h *HealthCounter) RecordPass() bool {
|
|
wasUp := h.IsUp()
|
|
if h.Health < h.Max() { h.Health++ }
|
|
return !wasUp && h.IsUp()
|
|
}
|
|
|
|
func (h *HealthCounter) RecordFail() bool {
|
|
wasDown := !h.IsUp()
|
|
if h.Health > 0 { h.Health-- }
|
|
return !wasDown && !h.IsUp()
|
|
}
|
|
```
|
|
|
|
Taking an example of `rise=2, fall=3`, the health counter will span `[0, 4]`. The state boundary
|
|
sits between the 'down' side (health of 0 or 1), and 'up' side (health of 2, 3 or 4). A backend
|
|
sitting at health counter 2 (just transitioned to 'up') will need three consecutive failures to go
|
|
down: 2->1->0.
|
|
|
|
When a backend enters the `unknown` state, for example when the `vpp-maglev` daemon just started, or
|
|
after a backend was briefly paused or disabled, I try to be a bit more clever than HAProxy (famous
|
|
last words, I'm sure), by pre-setting the health counter to `rise - 1`. This means the very first
|
|
probe resolves the state immediately: one pass produces an _unknown_ transition to _up_, and one
|
|
fail produces an _unknown_ transition to _down_. The shortcut allows any probe failure while the
|
|
state is `unknown` to immediately be marked down. I argue that a backend that cannot pass even its
|
|
very first probe should not receive traffic and we should not wait for its health to fall all the
|
|
way down to 0.
|
|
|
|
**Probe types.** `maglevd` starts off its life supporting four probe types:
|
|
|
|
- **`icmp`** - sends an ICMP echo request and waits for a reply, for which I do not need to run the
|
|
daemon with root privileges, instead I can assign `CAP_NET_RAW` for this purpose. This healthcheck
|
|
type is useful for checking basic reachability without opening a TCP connection. Borrowing again
|
|
from HAProxy, this can result in probe codes: `L4OK` on reply, `L4TOUT` on timeout, `L4CON` on send
|
|
error.
|
|
- **`tcp`** - opens a TCP connection to the configured port and closes it cleanly. This healthcheck
|
|
can optionally wrap the connection in TLS with parameter `ssl: true`, with optional server name and
|
|
`insecure-skip-verify` to allow for self-signed certificates. The resulting probe codes are `L4OK`
|
|
on connect, `L4CON` on refused, `L4TOUT` on timeout, `L6OK`/`L6CON`/`L6TOUT` for TLS.
|
|
- **`http`** - opens a TCP connection, sends an HTTP/1.1 `GET` request to the configured path with
|
|
an optional `Host` header, and validates the response code against a configured range (e.g.
|
|
`"200-204"`). This healthcheck can optionally validate the body against a regular expression, making
|
|
it similar to how Nagios does its checks. The probe return codes are: `L7OK` on success, `L7STS` on
|
|
unexpected status code, `L7RSP` on bad response, and `L7TOUT` on timeout.
|
|
- **`https`** - This is a special-case of the `http` healthcheck type, but using TLS. It supports
|
|
the use of SNI `server-name` override and `insecure-skip-verify` as well for backends with
|
|
self-signed certificates.
|
|
|
|
One other thing I noticed while reading the HAProxy docs is that its probe timing is not fixed,
|
|
instead depending on the counter state. A fully healthy backend (counter at maximum) is probed at
|
|
the configured `interval`. A degraded or unknown backend is probed at the faster `fast-interval`, to
|
|
be able to mark it either up or down more quickly. And, a fully down backend is probed at the slower
|
|
`down-interval`. The result of these is that a recovering backend is re-evaluated quickly while one
|
|
that has been offline for a long time generates less probe traffic.
|
|
|
|
I add one additional detail (which I've learned the hard way when operating very large loadbalancer
|
|
pools with thousands of backends), namely jitter: every computed interval (fast, down or normal)
|
|
is scaled by a uniformly-random factor of 10% so that all probe goroutines do not phase-lock to the
|
|
same wall-clock tick after a restart, and do not hit the backend at exactly the same time either.
|
|
Good for `vpp-maglev` and good for the backends. We can all win, sometimes :)
|
|
|
|
**Pool failover.** I've found it can be useful, mostly in smaller deployments like IPng's mail and
|
|
webserver cluster, to have primary traffic stay local to the Maglev loadbalancer (eg. a VPP Maglev
|
|
instance in Amsterdam will select nginx backends in Amsterdam, not Paris or Zurich), but if they are
|
|
all down, then fall back to further away backends in a different city.
|
|
|
|
This is how I came to the decision to give the ability for a frontend to have one or more pools, which
|
|
are priority tiers. The idea is that the active pool will be the first one that contains at least
|
|
one backend in `up` state. Backends in inactive pools have their weight effectively forced to zero
|
|
and will therefore receive no traffic. If all backends in the primary pool were to be down, the
|
|
weight of the next-best pool needs to be re-evaluated, and when the backends in the primary pool
|
|
recover, demotion of the standby pool can be graceful thanks to the `lb as ... weight` feature I
|
|
added to VPP: existing flows to standby backends are left to drain naturally. Only an operator
|
|
`disable` call will trigger an immediate flow-table flush.
|
|
|
|
## Controlplane API: gRPC Endpoint
|
|
|
|
I want all client-visible functionality to be exposed through a single gRPC service. Read-only
|
|
questions like 'how many frontends are there?' or 'what is the current health state of backend X?'
|
|
but also state changing questions like 'set frontend F's backend B to weight W' need to be simple
|
|
RPCs.
|
|
|
|
The most powerful RPC I add is called `WatchEvents`. This one returns a streaming response, and a
|
|
client can initiate a `WatchRequest` which specifies which event types to include. The `vpp-maglev`
|
|
daemon then pushes events as they happen - there is no polling. The event envelope is a protobuf `oneof`:
|
|
|
|
```protobuf
|
|
message Event {
|
|
oneof event {
|
|
LogEvent log = 1; // structured log record with key/value attrs
|
|
BackendEvent backend = 2; // backend state transition
|
|
FrontendEvent frontend = 3; // frontend aggregate state change
|
|
}
|
|
}
|
|
```
|
|
|
|
Using this approach allows the maglev daemon to send useful information to downstream consumers like
|
|
a CLI or WebUI in a simple yet extensible way. I imagine a CLI command like `watch events`, or a web
|
|
dashboard that shows health checks and state transitions in realtime. Those will be super useful and
|
|
can be observed within milliseconds without any busy-waiting or polling.
|
|
|
|
I didn't know this, but in the process of writing `vpp-maglev`, I learned about gRPC server
|
|
reflection, which I've enabled by default, so I can poke at the API without having the `.proto`
|
|
file, for example using `grpcurl` on the commandline:
|
|
|
|
```sh
|
|
pim@summer:~$ grpcurl -plaintext localhost:9090 list
|
|
pim@summer:~$ grpcurl -plaintext localhost:9090 maglev.Maglev/ListFrontends
|
|
pim@summer:~$ grpcurl -plaintext -d '{"name":"http-vip"}' localhost:9090 maglev.Maglev/GetFrontend
|
|
```
|
|
|
|
## Dataplane API: VPP Plugin Programming
|
|
|
|
There are two parts to programming the VPP dataplane state. First, a reconciler reacts to individual
|
|
backend state transitions, and then a VPP LB Sync module computes a minimal set of API calls to
|
|
perform to make the dataplane reflect the backend state as seen by the controlplane daemon.
|
|
|
|
I tried to keep the reconciler as simple as possible. It only subscribes to the healthchecker's
|
|
event channel and for every backend transition, calls `SyncLBStateVIP` for the affected frontend. To
|
|
catch drift in the VPP dataplane, for example if VPP restarted, or if we re-connected to VPP, a
|
|
periodic `SyncLBStateAll` also runs and sweeps up any changes. This should not occur in general
|
|
operation, though, it's a belt-and-suspenders type of thing.
|
|
|
|
This isolated `SyncLBState*` stuff is also a future hook for divorcing the healthchecker and the LB
|
|
reconciler into two different binaries: think of a datacenter with 100 maglev frontends and 1000
|
|
local backends. In such a scenario, having three (N+2) healthcheckers should be sufficient, no need
|
|
to have every maglev check every backend!
|
|
|
|
Otherwise, the reconciler carries no state of its own. I put all the logic in `SyncLBStateVIP`,
|
|
which computes the full desired state from the config and current health, diffs it against what VPP
|
|
has, and issues only the necessary Binary API calls to bring the two in sync.
|
|
|
|
### Dataplane API: Startup Warmup
|
|
|
|
{{< image width="7em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
|
|
|
During one of my tests, I noticed that after restarting the maglevd, it completely wipes the VPP
|
|
loadbalancer VIPs. In hindsight this makes total sense because when the healthchecker starts, all
|
|
backends are in `unknown` state, which causes the weights to be zero until the backends transition
|
|
to the `up` state. This causes thrashing in the dataplane, which is not what I intended. I think for
|
|
a bit and decide how I'm going to prevent that. My solution is a two-phase startup warmup controlled
|
|
by `startup-min-delay` (default 5s) and `startup-max-delay` (default 30s):
|
|
|
|
**Phase 1: hands-off window.** For the first `startup-min-delay` seconds after maglevd starts,
|
|
neither the reconciler nor the periodic sync loop can touch VPP at all. Probes run, the checker
|
|
accumulates state, but transitions are suppressed at the dataplane. VPP continues serving whatever
|
|
it was programmed with before the restart.
|
|
|
|
**Phase 2: per-VIP release.** Between `startup-min-delay` and `startup-max-delay`, each VIP is
|
|
released as soon as every backend it references has reached a non-`unknown` state. A background
|
|
poll running every 250 milliseconds checks for releasable VIPs, and the reconciler also checks
|
|
on every received transition. Whichever wins the race performs a single `SyncLBStateVIP` for that
|
|
VIP. It is free to live its life.
|
|
|
|
**Watchdog.** At `startup-max-delay`, any VIP whose backends are still `unknown` is swept by a
|
|
final `SyncLBStateAll`. Those stragglers are programmed with weight zero: something is still wrong
|
|
with them, but this is an unlikely situation, and one of those belt-and-suspenders things again.
|
|
|
|
## Controlplane CLI: `maglevc`
|
|
|
|
`maglevc` connects to a running `maglevd` over gRPC and either executes a single command or drops
|
|
into an interactive shell. The same command tree is available in both modes:
|
|
|
|
```sh
|
|
pim@summer:~$ maglevc show frontends
|
|
pim@summer:~$ maglevc show backends nginx0-nlams0
|
|
pim@summer:~$ maglevc --color=false show vpp lb state
|
|
pim@summer:~$ maglevc --server chbtl2.net.ipng.ch:9090 watch events log level debug backend
|
|
```
|
|
|
|
In interactive mode, the prompt is `maglev> `. I put real effort into the shell experience because
|
|
this is the tool I reach for constantly when I want to interact with the system. I'm inspired by
|
|
Bird and try to mimic its look and feel, which will come in handy as IPng Networks uses Bird in
|
|
our routing controlplane. Having these tools all look and feel the same really helps, especially
|
|
when fecal matter hits the fast-spinning cooling device.
|
|
|
|
### Command Tree and Completion
|
|
|
|
The CLI is built around a tree of command nodes. Each node carries a short description used for
|
|
inline help, a list of fixed keyword children, and optionally a live-completion function that
|
|
fetches candidates from the runtime state when the _tab_ key is pressed. For backend names, the
|
|
completion function calls `ListBackends` with a one-second timeout; for frontend names,
|
|
`ListFrontends`; and so on. Unambiguous prefixes complete in place; multiple matches are listed so
|
|
I know what to type next. I saw this trick first in the SR Linux command-line interface, and I like the
|
|
in-line completion logic a lot. As the Dutch would say, 'Beter goed gestolen dan slecht bedacht'.
|
|
|
|
**Prefix matching** means I never have to type the full command. `sh ba nginx0` is equivalent to
|
|
`show backends nginx0`, and `sh vpp l s` expands to `show vpp lb state`. This was important to me
|
|
because I am often working in a hurry and do not want to type long commands.
|
|
|
|
**Inline help** via `?` will print the available completions for the current cursor position with
|
|
a short description next to each keyword. The `?` character is not consumed - the input line is
|
|
unchanged after the help display, which is identical to how Bird consumes `?` characters.
|
|
|
|
**Color mode** defaults to on in the interactive shell and off in one-shot mode, so piped output is
|
|
always clean. You can override either default with `--color=true` or `--color=false`. This is of
|
|
course not necessary, but sometimes is helpful to see the difference between static tokens and
|
|
variable nouns in the output. I like it, anyway :)
|
|
|
|
### Viewing State
|
|
|
|
The most frequently used commands are the `show` family. `show backends <name>` shows the current
|
|
state, the enabled flag, the healthcheck, and the recent transition history with timestamps:
|
|
|
|
```
|
|
maglev> show backends nginx0-chplo0
|
|
name nginx0-chplo0
|
|
address 2001:678:d78:7::2:0
|
|
state up for 5d19h23m35s
|
|
enabled true
|
|
healthcheck nginx
|
|
transitions down → up 2026-04-24 18:19:51.608 5d19h23m35s ago
|
|
up → down 2026-04-23 22:14:48.311 6d15h28m39s ago
|
|
unknown → up 2026-04-22 09:44:31.664 8d3h58m55s ago
|
|
disabled → unknown 2026-04-22 09:44:30.628 8d3h58m56s ago
|
|
up → disabled 2026-04-22 09:41:54.495 8d4h1m33s ago
|
|
```
|
|
|
|
`show frontends <name>` shows both the configured weight and the effective weight for every backend
|
|
in every pool. The effective weight is what was actually programmed into VPP after pool failover
|
|
logic:
|
|
|
|
```
|
|
maglev> show frontends nginx-ip6-https
|
|
name nginx-ip6-https
|
|
address 2001:678:d78::1:0:1
|
|
protocol tcp
|
|
port 443
|
|
src-ip-sticky false
|
|
flush-on-down true
|
|
description IPv6 HTTPS VIP - nginx backends
|
|
pools
|
|
name primary
|
|
backends nginx0-chlzn0 weight 100 effective 100
|
|
nginx0-chplo0 weight 100 effective 0 [disabled]
|
|
name secondary
|
|
backends nginx0-nlams0 weight 100 effective 0
|
|
nginx0-frggh0 weight 100 effective 0
|
|
```
|
|
|
|
Here, I brought `nginx0-chplo0` down so its effective weight is zero; the two instances
|
|
`nginx0-nlams0` and `nginx0-frggh0` are in the secondary pool, which is inactive because the primary
|
|
pool still has `nginx0-chlzn0` up and serving (all) the traffic.
|
|
|
|
### VPP State - A Separate Concern
|
|
|
|
One design decision I am happy with is keeping the `maglevd` view of the world (frontend and backend
|
|
state, health counters, effective weights) completely separate from the VPP view (what is actually
|
|
programmed in the dataplane). Both are visible through `maglevc`, but through different commands:
|
|
|
|
```
|
|
maglev> show frontends # maglevd's view: pools, backends, effective weights
|
|
maglev> show vpp lb state # VPP's view: VIPs, AS addresses, bucket counts
|
|
maglev> show vpp lb counters # VPP's view: per-VIP packet/byte counters
|
|
```
|
|
|
|
The `show vpp lb state` command shows the VPP load-balancer state as the plugin sees it: each VIP
|
|
with its application servers, their VPP-side weights, and how many of the 1024 Maglev hash buckets
|
|
are assigned to each AS. This is invaluable for confirming that a sync operation actually reached
|
|
VPP, and for debugging bucket distribution across backends with different weights.
|
|
|
|
### Operator Actions
|
|
|
|
The `set` commands drive mutations. `set backend <name> pause` stops the probe goroutine and drives
|
|
the effective weight to zero; `set backend <name> disable` does the same but also flushes existing
|
|
flows. `set backend <name> resume` and `set backend <name> enable` restart probing and recompute
|
|
effective weights when the backend is ready to serve again.
|
|
|
|
Weight changes are immediate:
|
|
|
|
```
|
|
maglev> set frontend nginx-ip6-https pool primary backend nginx0-chplo0 weight 0
|
|
maglev> set frontend nginx-ip6-https pool primary backend nginx0-chplo0 weight 0 flush
|
|
maglev> set backend nginx0-chplo0 disable
|
|
```
|
|
|
|
The first command gracefully drains `nginx0-chplo0` from the pool `primary` in frontend
|
|
`nginx-ip6-https`. When setting the weight to zero, new flows go elsewhere but existing ones finish.
|
|
The second flushes existing flows immediately. The third command then marks the backend as disabled,
|
|
which will remove it from serving in all pools it's a member of. This is useful when performing
|
|
maintenance on a backend, and it's the command I ran in the 'show frontend' output above.
|
|
|
|
Arguably the coolest idea, `maglevc watch events`, streams everything in real time. Combined with
|
|
`log level debug`, it shows every probe attempt and every VPP API call as they happen:
|
|
|
|
```
|
|
maglev> watch events log level debug backend
|
|
{"backend":{"backendName":"nginx0-chlzn0","transition":{"from":"up","to":"up"}}}
|
|
{"backend":{"backendName":"nginx0-chplo0","transition":{"from":"up","to":"up"}}}
|
|
{"backend":{"backendName":"nginx0-frggh0","transition":{"from":"up","to":"up"}}}
|
|
{"backend":{"backendName":"nginx0-nlams0","transition":{"from":"up","to":"up"}}}
|
|
{"log":{"atUnixNs":"1777558154335278835","level":"DEBUG","msg":"probe-start",
|
|
"attrs":[{"key":"backend","value":"nginx0-chplo0"},{"key":"type","value":"https"}]}}
|
|
{"log":{"atUnixNs":"1777558154371619020","level":"DEBUG","msg":"probe-done",
|
|
"attrs":[{"key":"backend","value":"nginx0-chplo0"},{"key":"type","value":"https"},
|
|
{"key":"ok","value":"true"},{"key":"code","value":"L7OK"},{"key":"detail"},
|
|
{"key":"elapsed","value":"36ms"}]}}
|
|
```
|
|
|
|
And finally, I mimic Bird's "reconfigure" with a set of two primitives `config check` and `config
|
|
reload` which let me validate and apply configuration changes without restarting the daemon. With
|
|
that, the maglev daemon, the main brains of the operation, is feature complete.
|
|
|
|
## Test Utility: `maglevt`
|
|
|
|
Once `maglevd` is running and `maglevc` shows everything healthy, the natural next question is: does
|
|
it actually work end-to-end? A healthcheck passing means the backend can accept a TCP connection
|
|
or return an HTTP 200, but it does not tell me whether a client hitting the VIP actually reaches the
|
|
right backend, or whether failover is visible at the application level?
|
|
|
|
I wanted a tool that could sit outside the control plane entirely - not talking gRPC, not reading
|
|
`maglevd` state - but just hitting the VIPs directly as a real client would, tallying which backend
|
|
served each request. The obvious approach is to configure each backend to include its own hostname
|
|
in an HTTP response header. On my nginx servers I add a header `X-IPng-Frontend` which returns the
|
|
local `$hostname` variable. Then a probe tool that reads `X-IPng-Frontend` from each response can
|
|
show the live distribution across backends, and a failover is immediately visible as a
|
|
redistribution of the tally.
|
|
|
|
That idea turns into `maglevt`, which reads one or more `maglev.yaml` files, enumerates the
|
|
HTTP/HTTPS frontends, and probes each VIP at a configurable interval (default 100ms per VIP, with
|
|
+/-10% jitter to prevent phase-locking). Each probe opens a fresh TCP connection - keep-alives are off
|
|
by default - so every request is independently hashed by VPP's Maglev algorithm. The tally
|
|
reshuffles the moment a backend goes down or a standby pool activates.
|
|
|
|
The UI is a terminal dashboard built with [[Bubble Tea](https://github.com/charmbracelet/bubbletea)],
|
|
a Go TUI library. Each VIP gets a tile showing a rolling latency summary (min, max, average, p95),
|
|
running success and failure counts, and the response header tally, and a set of errors, like so:
|
|
|
|
{{< image width="100%" src="/assets/vpp-maglev/maglevt.png" alt="VPP Maglev TUI client" >}}
|
|
|
|
There's a lot to see in this screenshot, so let me unpack it. I'm running `maglevt` on a machine at
|
|
AS12859, BIT in the Netherlands, called `nlede01.paphosting.net`. It's reaching the VIPs that are
|
|
announced in Amsterdam, the Netherlands (`vip0.l.ipng.ch`) and Lille, France (`vip1.l.ipng.ch`), and
|
|
it is doing so with both IPv4 and IPv6, and it is doing so on port 80 and 443, which yields eight
|
|
targets. The webservers are configured to respond with an empty HTTP 204 response, and I've replayed
|
|
about 1Mio requests to each VIP. A few of these failed, which was mostly me playing around with
|
|
backend drains/flushes, hostile shutdowns (rebooting an nginx), and VIP failover scenarios. Then,
|
|
each VIP shows its last 100 probes in terms of latency, latency tail, and success rate.
|
|
|
|
In the second section, the tool is just showing how many times a response had a certain HTTP header
|
|
in it. The greyed out ones are values which have not been seen in five seconds, the white ones are
|
|
seen: it shows that I'm consistently hashing this client to one frontend at a time (because each row
|
|
has exactly one bright white entry): this test is using HTTP keepalive.
|
|
|
|
In the bottom section, a list of recent events is shown - this is mostly when the latency ceiling is
|
|
hit. These are 'spikes' written in bright yellow, or if things like timeouts occur, they would be
|
|
written in bright red.
|
|
|
|
{{< image width="4em" float="left" src="/assets/vpp-maglev/Claude_AI.svg" alt="Claude Code" >}}
|
|
|
|
I have to be honest here: before this project I had never written a Terminal UI in my life. The
|
|
Bubble Tea documentation is good but the model - a pure functional message-passing loop - took me
|
|
a while to internalize. I ended up leaning on Claude quite a bit to get the layout right, especially
|
|
the live-updating cells and the latency histogram accumulation.
|
|
|
|
What I found was that I could describe what I wanted in plain language and the code that came back
|
|
was usually correct and idiomatic. I then spent time reading and understanding the code before
|
|
committing it. I learned a lot about how Go handles terminal output and about the Elm architecture
|
|
that Bubble Tea is based on - much faster than I would have on my own. Having an AI collaborator
|
|
that writes correct code does not mean I can stop learning; if anything, having working code in
|
|
front of me makes the learning faster!
|
|
|
|
## Frontend: GUI `maglevd-frontend`
|
|
|
|
Now that I'm in "yes, I vibe"-admission-mode, there's another type of component I've rarely if ever
|
|
worked on: web frontends! `maglevd-frontend` is a single Go binary with a
|
|
[[SolidJS](https://www.solidjs.com/)] single-page app embedded at build time via `//go:embed` - no
|
|
runtime file dependencies, no Node.js required after the build. Simple and standalone.
|
|
|
|
One design goal I set early was to be able to observe all my load balancer instances from a single
|
|
dashboard. `maglevd-frontend` connects to one or more `maglevd` instances by adding them to a `--server`
|
|
flag upon startup.
|
|
|
|
At the top of the page, I add a **scope selector**, one pill per configured `maglevd`, colored green when
|
|
the frontend's gRPC channel to that instance is alive and red when it cannot connect. Clicking a pill
|
|
switches the entire view to that site's frontends. I notice that reloading the page resets all of
|
|
it, so I add a cookie so that all selections can persist across page reloads.
|
|
|
|
### Frontend: Live Event Streaming
|
|
|
|
I learn about Server-Sent Events (SSE): `maglevd-frontend` subscribes to `WatchEvents` on each
|
|
configured `maglevd` and translates the gRPC stream into SSE events on the `/view/api/events`
|
|
endpoint. The browser's EventSource API reconnects automatically on disconnect, and the server
|
|
maintains a 30-second / 2000-event ring buffer so that a page reload replays recent events using
|
|
`Last-Event-ID`. I'm pleased with the result: a dashboard that stays current in real time with no
|
|
polling and visible catch-up after a brief disconnect, like a laptop lid close.
|
|
|
|
When a backend transitions from `up` to `down`, the badge in the frontend card updates within
|
|
milliseconds. A pool failover - where the primary pool empties and the fallback pool activates -
|
|
appears as a cascade of state changes followed by a re-rendering of the effective weight column. The
|
|
LB buckets column (showing VPP's actual hash table allocation for each AS) is refreshed via a
|
|
debounced `GetVPPLBState` scrape on every transition, at most once per second per `maglevd`. And
|
|
looking at this frontend, it may be clear to you why I designed the backend to have a subscribable
|
|
event stream:
|
|
|
|
{{< image width="100%" src="/assets/vpp-maglev/maglev-frontend.png" alt="VPP Maglev Frontend" >}}
|
|
|
|
The tech stack for the Single Page App is [[SolidJS](https://www.solidjs.com/)], a super cool reactive
|
|
framework that compiles away its virtual DOM and produces small, fast bundles. I chose it over React
|
|
partly because I was curious about it and partly because the bundle size matters when you are
|
|
embedding the whole thing in a Go binary. The event store is a simple Solid signal that the SSE
|
|
handler updates; every component that cares re-renders automatically without explicit subscription
|
|
management. It's slick and much easier to use than I had initially thought!
|
|
|
|
### Frontend: Admin Surface
|
|
|
|
When both `MAGLEV_FRONTEND_USER` and `MAGLEV_FRONTEND_PASSWORD` environment variables are set, the
|
|
admin surface is activated at `/admin/`. I make sure that without credentials, `/admin/` returns
|
|
404. In this case, the admin path is not just unprotected, it is entirely absent. Security matters,
|
|
at least a little bit, even if the frontend will not be exposed onto the Internet.
|
|
|
|
In admin mode, every backend row grows a `⋮` (kebab) menu with `pause`, `resume`, `enable`,
|
|
`disable`, and `set weight` entries. Lifecycle actions open a confirmation dialog that spells out the
|
|
dataplane consequence: `disable` explicitly warns that it will drop live sessions via the flow-table
|
|
flush. The weight dialog has a 0-100 slider and a `flush existing flows` checkbox - unchecked is the
|
|
graceful drain, checked is the immediate session-drop path.
|
|
|
|
Also in admin mode, a **Debug panel** at the bottom of the page tails every event the SPA has seen
|
|
across all `maglevd` instances: backend and frontend transitions, log lines, VPP LB sync events, and
|
|
connection status flips, all formatted for scanning. A scope filter narrows the tail to the current
|
|
`maglevd`; an `all maglevds` checkbox enables firehose mode; a `pause` button freezes the tail so
|
|
you can read back.
|
|
|
|
## Results
|
|
|
|
I've rolled this out at IPng Networks a few weeks ago, and it's been running rock solid ever since.
|
|
I've taken four VPP machines, connected them to the core routers, and started to announce two VIPs,
|
|
each announced in two cities. `vip0` is announced from Zurich (Switzerland) and Amsterdam (the
|
|
Netherlands), and `vip1` is announced from Lucerne (Switzerland) and Lille (France). I've moved over
|
|
most websites, as I find putting skin in the game is important:
|
|
|
|
```
|
|
pim@summer:~$ host ipng.ch
|
|
ipng.ch has address 194.1.163.31
|
|
ipng.ch has address 194.126.235.31
|
|
ipng.ch has IPv6 address 2001:678:d78::1:0:1
|
|
ipng.ch has IPv6 address 2a0b:dd80::1:0:1
|
|
```
|
|
|
|
The only service I'm a bit apprehensive about - even though I don't think I need to be - is the
|
|
[[Static CT Logs](/s/ct/)], which do about 2.5kqps of reads and 400qps of writes at the moment. The
|
|
plan is to let this marinate for a few weeks, and then move the read-path and later on, also the
|
|
write-path to this construction.
|
|
|
|
You can find the project at [[git.ipng.ch/ipng/vpp-maglev](https://git.ipng.ch/ipng/vpp-maglev.git)]
|
|
and debian packages are on [[deb.ipng.ch](https://deb.ipng.ch/)]. I wrote some reasonable
|
|
documentation for the project at:
|
|
* [[docs/design.md](https://git.ipng.ch/ipng/vpp-maglev/src/branch/main/docs/design.md)] on the
|
|
architecture, components, and numbered functional / non-functional requirements. Start here if
|
|
you want the big picture before diving into the code.
|
|
* [[docs/user-guide.md](https://git.ipng.ch/ipng/vpp-maglev/src/branch/main/docs/user-guide.md)]
|
|
describes the flags, signals, and maglevc command reference.
|
|
* [[docs/config-guide.md](https://git.ipng.ch/ipng/vpp-maglev/src/branch/main/docs/config-guide.md)]
|
|
shows the full YAML configuration file reference.
|
|
* [[docs/healthchecks.md](https://git.ipng.ch/ipng/vpp-maglev/src/branch/main/docs/healthchecks.md)]
|
|
is a deepdive on the health state machine, probe scheduling, and rise/fall semantics.
|
|
|
|
## What's Next
|
|
|
|
Using Maglev has a few significant benefits. Most importantly, I can drain (or weather an outage of)
|
|
any nginx frontend within seconds, and there is no more DNS propagation delay. Another key property
|
|
is that the loadbalanced VIPs themselves are now completely mobile, and anycasted. I can drain a VPP
|
|
loadbalancer by simply removing its announcement of the VIPs, and anycast routing will seamlessly
|
|
move the traffic to another live replica. This immunizes IPng from site / datacenter / machine
|
|
failures as well, as rerouting happens within only a few seconds.
|
|
|
|
However, there's also a few smaller downsides. Notably, this setup is more complex than merely
|
|
having "the webserver", there are now half a dozen webservers, and potentially half a dozen places
|
|
where traffic can enter the system, which poses a challenge with observability. In an upcoming
|
|
article, I'll spend some time thinking through how to make it as easy as possible, with Prometheus
|
|
and Grafana dashboards, as well as a clever trick to be able to see which Maglev loadbalancer sent
|
|
which request to which IPng nginx Frontend. If this type of thing is interesting to you, stay tuned!
|