This commit is contained in:
@@ -32,7 +32,7 @@ to backends at very high rates, and all the rest is left as an exercise for the
|
||||
The core problem is that VPP's `lb` plugin is pure dataplane. It holds a table of VIPs, each with a
|
||||
set of application servers and their weights using the feature I added. It then hashes new flows
|
||||
deterministically onto those servers. That is cool, but it is all it does. If a backend stops
|
||||
responding, VPP does not know and does not care — it will keep sending traffic to that address until
|
||||
responding, VPP does not know and does not care - it will keep sending traffic to that address until
|
||||
someone or something tells it otherwise. The result is a black hole: clients trying to establish new
|
||||
connections time out while waiting for a backend that will never respond.
|
||||
|
||||
@@ -61,7 +61,7 @@ controlplane daemon), `maglevc` (a CLI for it), `maglevd-frontend` (a web dashbo
|
||||
Before blindly writing code, I wrote down a few of the constraints I wanted to hold true. Wait, a
|
||||
design you say? Well, yes! And this design turned out to drive most of the architectural decisions:
|
||||
|
||||
**One source of truth.** Every component — CLI, web dashboard, alerting scripts — reads `maglevd`
|
||||
**One source of truth.** Every component - CLI, web dashboard, alerting scripts - reads `maglevd`
|
||||
through one typed gRPC interface. There is no secondary control plane. The CLI and the web dashboard
|
||||
show exactly the same state as each other because they both ask the same controlplane daemon.
|
||||
|
||||
@@ -129,7 +129,7 @@ maglev:
|
||||
nginx0-fra: {}
|
||||
```
|
||||
|
||||
A **healthcheck** defines how to probe — the protocol, port, success criteria, timing parameters,
|
||||
A **healthcheck** defines how to probe - the protocol, port, success criteria, timing parameters,
|
||||
and so on. A **backend** is a named IP address bound to a healthcheck. A **frontend** is a VIP
|
||||
address with one or more named **pools**, where each pool is an ordered list of `(backend, weight)`
|
||||
tuples. At runtime, each backend gets exactly one probe (which Go lets me use goroutines for),
|
||||
@@ -163,7 +163,7 @@ On each probe, a pass increments the counter (ceiling at maximum); a failure dec
|
||||
at zero). This gives **hysteresis**: a backend sitting at the rise boundary needs `fall`
|
||||
consecutive failures before it transitions to down, and a fully-down backend needs `rise`
|
||||
consecutive passes to come back up. A flapping backend that alternates between passing and failing
|
||||
stays in the degraded zone without bouncing between states — which is exactly what I want to
|
||||
stays in the degraded zone without bouncing between states - which is exactly what I want to
|
||||
avoid a storm of VPP API calls from a noisy backend.
|
||||
|
||||
In _pseudocode_, here's what that simple yet elegant approach looks like:
|
||||
@@ -206,21 +206,21 @@ way down to 0.
|
||||
|
||||
**Probe types.** `maglevd` starts off its life supporting four probe types:
|
||||
|
||||
- **`icmp`** — sends an ICMP echo request and waits for a reply, for which I do not need to run the
|
||||
- **`icmp`** - sends an ICMP echo request and waits for a reply, for which I do not need to run the
|
||||
daemon with root privileges, instead I can assign `CAP_NET_RAW` for this purpose. This healthcheck
|
||||
type is useful for checking basic reachability without opening a TCP connection. Borrowing again
|
||||
from HAProxy, this can result in probe codes: `L4OK` on reply, `L4TOUT` on timeout, `L4CON` on send
|
||||
error.
|
||||
- **`tcp`** — opens a TCP connection to the configured port and closes it cleanly. This healthcheck
|
||||
- **`tcp`** - opens a TCP connection to the configured port and closes it cleanly. This healthcheck
|
||||
can optionally wrap the connection in TLS with parameter `ssl: true`, with optional server name and
|
||||
`insecure-skip-verify` to allow for self-signed certificates. The resulting probe codes are `L4OK`
|
||||
on connect, `L4CON` on refused, `L4TOUT` on timeout, `L6OK`/`L6CON`/`L6TOUT` for TLS.
|
||||
- **`http`** — opens a TCP connection, sends an HTTP/1.1 `GET` request to the configured path with
|
||||
- **`http`** - opens a TCP connection, sends an HTTP/1.1 `GET` request to the configured path with
|
||||
an optional `Host` header, and validates the response code against a configured range (e.g.
|
||||
`"200-204"`). This healthcheck can optionally validate the body against a regular expression, making
|
||||
it similar to how Nagios does its checks. The probe return codes are: `L7OK` on success, `L7STS` on
|
||||
unexpected status code, `L7RSP` on bad response, and `L7TOUT` on timeout.
|
||||
- **`https`** — This is a special-case of the `http` healthcheck type, but using TLS. It supports
|
||||
- **`https`** - This is a special-case of the `http` healthcheck type, but using TLS. It supports
|
||||
the use of SNI `server-name` override and `insecure-skip-verify` as well for backends with
|
||||
self-signed certificates.
|
||||
|
||||
@@ -260,7 +260,7 @@ RPCs.
|
||||
|
||||
The most powerful RPC I add is called `WatchEvents`. This one returns a streaming response, and a
|
||||
client can initiate a `WatchRequest` which specifies which event types to include. The `vpp-maglev`
|
||||
daemon then pushes events as they happen — there is no polling. The event envelope is a protobuf `oneof`:
|
||||
daemon then pushes events as they happen - there is no polling. The event envelope is a protobuf `oneof`:
|
||||
|
||||
```protobuf
|
||||
message Event {
|
||||
@@ -296,8 +296,8 @@ perform to make the dataplane reflect the backend state as seen by the controlpl
|
||||
I tried to keep the reconciler as simple as possible. It only subscribes to the healthchecker's
|
||||
event channel and for every backend transition, calls `SyncLBStateVIP` for the affected frontend. To
|
||||
catch drift in the VPP dataplane, for example if VPP restarted, or if we re-connected to VPP, a
|
||||
periodic `SyncLBStateAll` also runs and sweeps up any changes - these should not occur in general
|
||||
operation, though. It's a belt-and-suspenders type of thing.
|
||||
periodic `SyncLBStateAll` also runs and sweeps up any changes. This should not occur in general
|
||||
operation, though, it's a belt-and-suspenders type of thing.
|
||||
|
||||
This isolated `SyncLBState*` stuff is also a future hook for divorcing the healthchecker and the LB
|
||||
reconciler into two different binaries: think of a datacenter with 100 maglev frontends and 1000
|
||||
@@ -306,7 +306,7 @@ to have every maglev check every backend!
|
||||
|
||||
Otherwise, the reconciler carries no state of its own. I put all the logic in `SyncLBStateVIP`,
|
||||
which computes the full desired state from the config and current health, diffs it against what VPP
|
||||
has, and issues only the necessary API calls.
|
||||
has, and issues only the necessary Binary API calls to bring the two in sync.
|
||||
|
||||
### Dataplane API: Startup Warmup
|
||||
|
||||
@@ -315,9 +315,9 @@ has, and issues only the necessary API calls.
|
||||
During one of my tests, I noticed that after restarting the maglevd, it completely wipes the VPP
|
||||
loadbalancer VIPs. In hindsight this makes total sense because when the healthchecker starts, all
|
||||
backends are in `unknown` state, which causes the weights to be zero until the backends transition
|
||||
to the `up` state. This causes thrashing in the dataplane, which is not OK! I think for a bit and
|
||||
decide how I'm going to prevent that. My solution is a two-phase startup warmup controlled by
|
||||
`startup-min-delay` (default 5s) and `startup-max-delay` (default 30s):
|
||||
to the `up` state. This causes thrashing in the dataplane, which is not what I intended. I think for
|
||||
a bit and decide how I'm going to prevent that. My solution is a two-phase startup warmup controlled
|
||||
by `startup-min-delay` (default 5s) and `startup-max-delay` (default 30s):
|
||||
|
||||
**Phase 1: hands-off window.** For the first `startup-min-delay` seconds after maglevd starts,
|
||||
neither the reconciler nor the periodic sync loop can touch VPP at all. Probes run, the checker
|
||||
@@ -331,7 +331,7 @@ on every received transition. Whichever wins the race performs a single `SyncLBS
|
||||
VIP. It is free to live its life.
|
||||
|
||||
**Watchdog.** At `startup-max-delay`, any VIP whose backends are still `unknown` is swept by a
|
||||
final `SyncLBStateAll`. Those stragglers are programmed with weight zero — something is still wrong
|
||||
final `SyncLBStateAll`. Those stragglers are programmed with weight zero: something is still wrong
|
||||
with them, but this is an unlikely situation, and one of those belt-and-suspenders things again.
|
||||
|
||||
## Controlplane CLI: `maglevc`
|
||||
@@ -348,7 +348,7 @@ pim@summer:~$ maglevc --server chbtl2.net.ipng.ch:9090 watch events log level de
|
||||
|
||||
In interactive mode, the prompt is `maglev> `. I put real effort into the shell experience because
|
||||
this is the tool I reach for constantly when I want to interact with the system. I'm inspired by
|
||||
Bird2 and try to mimic its look and feel, which will come in handy as IPng Networks uses Bird in
|
||||
Bird and try to mimic its look and feel, which will come in handy as IPng Networks uses Bird in
|
||||
our routing controlplane. Having these tools all look and feel the same really helps, especially
|
||||
when fecal matter hits the fast-spinning cooling device.
|
||||
|
||||
@@ -363,11 +363,11 @@ I know what to type next. I saw this trick first in the SR Linux command-line in
|
||||
in-line completion logic a lot. As the Dutch would say, 'Beter goed gestolen dan slecht bedacht'.
|
||||
|
||||
**Prefix matching** means I never have to type the full command. `sh ba nginx0` is equivalent to
|
||||
`show backends nginx0`, and `sh v l s` expands to `show vpp lb state`. This was important to me
|
||||
`show backends nginx0`, and `sh vpp l s` expands to `show vpp lb state`. This was important to me
|
||||
because I am often working in a hurry and do not want to type long commands.
|
||||
|
||||
**Inline help** via `?` will print the available completions for the current cursor position with
|
||||
a short description next to each keyword. The `?` character is not consumed — the input line is
|
||||
a short description next to each keyword. The `?` character is not consumed - the input line is
|
||||
unchanged after the help display, which is identical to how Bird consumes `?` characters.
|
||||
|
||||
**Color mode** defaults to on in the interactive shell and off in one-shot mode, so piped output is
|
||||
@@ -406,7 +406,7 @@ protocol tcp
|
||||
port 443
|
||||
src-ip-sticky false
|
||||
flush-on-down true
|
||||
description IPv6 HTTPS VIP — nginx backends
|
||||
description IPv6 HTTPS VIP - nginx backends
|
||||
pools
|
||||
name primary
|
||||
backends nginx0-chlzn0 weight 100 effective 100
|
||||
@@ -420,7 +420,7 @@ Here, I brought `nginx0-chplo0` down so its effective weight is zero; the two in
|
||||
`nginx0-nlams0` and `nginx0-frggh0` are in the secondary pool, which is inactive because the primary
|
||||
pool still has `nginx0-chlzn0` up and serving (all) the traffic.
|
||||
|
||||
### VPP State — A Separate Concern
|
||||
### VPP State - A Separate Concern
|
||||
|
||||
One design decision I am happy with is keeping the `maglevd` view of the world (frontend and backend
|
||||
state, health counters, effective weights) completely separate from the VPP view (what is actually
|
||||
@@ -458,8 +458,8 @@ The second flushes existing flows immediately. The third command then marks the
|
||||
which will remove it from serving in all pools it's a member of. This is useful when performing
|
||||
maintenance on a backend, and it's the command I ran in the 'show frontend' output above.
|
||||
|
||||
`maglevc watch events` streams everything in real time. Combined with `log level debug`, it shows
|
||||
every probe attempt and every VPP API call as they happen:
|
||||
Arguably the coolest idea, `maglevc watch events`, streams everything in real time. Combined with
|
||||
`log level debug`, it shows every probe attempt and every VPP API call as they happen:
|
||||
|
||||
```
|
||||
maglev> watch events log level debug backend
|
||||
@@ -475,7 +475,7 @@ maglev> watch events log level debug backend
|
||||
{"key":"elapsed","value":"36ms"}]}}
|
||||
```
|
||||
|
||||
And finally, I mimic Bird2's "reconfigure" with a set of two primitives `config check` and `config
|
||||
And finally, I mimic Bird's "reconfigure" with a set of two primitives `config check` and `config
|
||||
reload` which let me validate and apply configuration changes without restarting the daemon. With
|
||||
that, the maglev daemon, the main brains of the operation, is feature complete.
|
||||
|
||||
@@ -484,19 +484,19 @@ that, the maglev daemon, the main brains of the operation, is feature complete.
|
||||
Once `maglevd` is running and `maglevc` shows everything healthy, the natural next question is: does
|
||||
it actually work end-to-end? A healthcheck passing means the backend can accept a TCP connection
|
||||
or return an HTTP 200, but it does not tell me whether a client hitting the VIP actually reaches the
|
||||
right backend, or whether failover is visible at the application level.
|
||||
right backend, or whether failover is visible at the application level?
|
||||
|
||||
I wanted a tool that could sit outside the control plane entirely — not talking gRPC, not reading
|
||||
`maglevd` state — and just hit the VIPs directly as a real client would, tallying which backend
|
||||
I wanted a tool that could sit outside the control plane entirely - not talking gRPC, not reading
|
||||
`maglevd` state - but just hitting the VIPs directly as a real client would, tallying which backend
|
||||
served each request. The obvious approach is to configure each backend to include its own hostname
|
||||
in an HTTP response header. On my NGINX servers I add a header `X-IPng-Frontend` which returns the
|
||||
in an HTTP response header. On my nginx servers I add a header `X-IPng-Frontend` which returns the
|
||||
local `$hostname` variable. Then a probe tool that reads `X-IPng-Frontend` from each response can
|
||||
show the live distribution across backends, and a failover is immediately visible as a
|
||||
redistribution of the tally.
|
||||
|
||||
That idea turns into `maglevt`, which reads one or more `maglev.yaml` files, enumerates the
|
||||
HTTP/HTTPS frontends, and probes each VIP at a configurable interval (default 100ms per VIP, with
|
||||
+/-10% jitter to prevent phase-locking). Each probe opens a fresh TCP connection — keep-alives are off
|
||||
+/-10% jitter to prevent phase-locking). Each probe opens a fresh TCP connection - keep-alives are off
|
||||
by default - so every request is independently hashed by VPP's Maglev algorithm. The tally
|
||||
reshuffles the moment a backend goes down or a standby pool activates.
|
||||
|
||||
@@ -507,8 +507,8 @@ running success and failure counts, and the response header tally, and a set of
|
||||
{{< image width="100%" src="/assets/vpp-maglev/maglevt.png" alt="VPP Maglev TUI client" >}}
|
||||
|
||||
There's a lot to see in this screenshot, so let me unpack it. I'm running `maglevt` on a machine at
|
||||
AS12859, BIT in the Netherlands called `nlede01.paphosting.net`. It's reaching the VIPs that are
|
||||
announced in Amsterdam (the Netherlands, `vip0.l.ipng.ch`) and Lille (France, `vip1.l.ipng.ch`), and
|
||||
AS12859, BIT in the Netherlands, called `nlede01.paphosting.net`. It's reaching the VIPs that are
|
||||
announced in Amsterdam, the Netherlands (`vip0.l.ipng.ch`) and Lille, France (`vip1.l.ipng.ch`), and
|
||||
it is doing so with both IPv4 and IPv6, and it is doing so on port 80 and 443, which yields eight
|
||||
targets. The webservers are configured to respond with an empty HTTP 204 response, and I've replayed
|
||||
about 1Mio requests to each VIP. A few of these failed, which was mostly me playing around with
|
||||
@@ -516,33 +516,33 @@ backend drains/flushes, hostile shutdowns (rebooting an nginx), and VIP failover
|
||||
each VIP shows its last 100 probes in terms of latency, latency tail, and success rate.
|
||||
|
||||
In the second section, the tool is just showing how many times a response had a certain HTTP header
|
||||
in it. The greyed out ones are headers which have not been seen in five seconds, the white ones are
|
||||
in it. The greyed out ones are values which have not been seen in five seconds, the white ones are
|
||||
seen: it shows that I'm consistently hashing this client to one frontend at a time (because each row
|
||||
has exactly one bright white entry): this test is using HTTP keepalive.
|
||||
|
||||
In the bottom section, a list of recent events is shown - this is mostly when the latency ceiling is
|
||||
hit. These are 'spikes' written in bright yellow, or things like timeouts occur, which would be
|
||||
hit. These are 'spikes' written in bright yellow, or if things like timeouts occur, they would be
|
||||
written in bright red.
|
||||
|
||||
{{< image width="4em" float="left" src="/assets/vpp-maglev/Claude_AI.svg" alt="Claude Code" >}}
|
||||
|
||||
I have to be honest here: before this project I had never written a terminal UI in my life. The
|
||||
Bubble Tea documentation is good but the model — a pure functional message-passing loop — took me
|
||||
I have to be honest here: before this project I had never written a Terminal UI in my life. The
|
||||
Bubble Tea documentation is good but the model - a pure functional message-passing loop - took me
|
||||
a while to internalize. I ended up leaning on Claude quite a bit to get the layout right, especially
|
||||
the live-updating cells and the latency histogram accumulation.
|
||||
|
||||
What I found was that I could describe what I wanted in plain language and the code that came back
|
||||
was usually correct and idiomatic. I then spent time reading and understanding the code before
|
||||
committing it. I learned a lot about how Go handles terminal output and about the Elm architecture
|
||||
that Bubble Tea is based on — much faster than I would have on my own. Having an AI collaborator
|
||||
that writes correct code does not mean you stop learning; if anything, having working code in front
|
||||
of you makes the learning faster!
|
||||
that Bubble Tea is based on - much faster than I would have on my own. Having an AI collaborator
|
||||
that writes correct code does not mean I can stop learning; if anything, having working code in
|
||||
front of me makes the learning faster!
|
||||
|
||||
## Frontend: GUI `maglevd-frontend`
|
||||
|
||||
Now that I'm in "yes, I vibe"-admission-mode, there's another type of component I've rarely if ever
|
||||
worked on: web frontends! `maglevd-frontend` is an optional web dashboard, a single Go binary with a
|
||||
[[SolidJS](https://www.solidjs.com/)] single-page app embedded at build time via `//go:embed` — no
|
||||
worked on: web frontends! `maglevd-frontend` is a single Go binary with a
|
||||
[[SolidJS](https://www.solidjs.com/)] single-page app embedded at build time via `//go:embed` - no
|
||||
runtime file dependencies, no Node.js required after the build. Simple and standalone.
|
||||
|
||||
One design goal I set early was to be able to observe all my load balancer instances from a single
|
||||
@@ -564,7 +564,7 @@ maintains a 30-second / 2000-event ring buffer so that a page reload replays rec
|
||||
polling and visible catch-up after a brief disconnect, like a laptop lid close.
|
||||
|
||||
When a backend transitions from `up` to `down`, the badge in the frontend card updates within
|
||||
milliseconds. A pool failover — where the primary pool empties and the fallback pool activates —
|
||||
milliseconds. A pool failover - where the primary pool empties and the fallback pool activates -
|
||||
appears as a cascade of state changes followed by a re-rendering of the effective weight column. The
|
||||
LB buckets column (showing VPP's actual hash table allocation for each AS) is refreshed via a
|
||||
debounced `GetVPPLBState` scrape on every transition, at most once per second per `maglevd`. And
|
||||
@@ -573,24 +573,24 @@ event stream:
|
||||
|
||||
{{< image width="100%" src="/assets/vpp-maglev/maglev-frontend.png" alt="VPP Maglev Frontend" >}}
|
||||
|
||||
The tech stack for the Single Page App is [SolidJS](https://www.solidjs.com/), a super cool reactive
|
||||
The tech stack for the Single Page App is [[SolidJS](https://www.solidjs.com/)], a super cool reactive
|
||||
framework that compiles away its virtual DOM and produces small, fast bundles. I chose it over React
|
||||
partly because I was curious about it and partly because the bundle size matters when you are
|
||||
embedding the whole thing in a Go binary. The event store is a simple Solid signal that the SSE
|
||||
handler updates; every component that cares re-renders automatically without explicit subscription
|
||||
management. It's slick!
|
||||
management. It's slick and much easier to use than I had initially thought!
|
||||
|
||||
### Frontend: Admin Surface
|
||||
|
||||
When both `MAGLEV_FRONTEND_USER` and `MAGLEV_FRONTEND_PASSWORD` environment variables are set, the
|
||||
admin surface is activated at `/admin/`. Without credentials, `/admin/` returns 404 — the admin
|
||||
path is not just unprotected, it is entirely absent. Security matters, at least a little bit, even
|
||||
if the frontend will not be exposed onto the Internet.
|
||||
admin surface is activated at `/admin/`. I make sure that without credentials, `/admin/` returns
|
||||
404. In this case, the admin path is not just unprotected, it is entirely absent. Security matters,
|
||||
at least a little bit, even if the frontend will not be exposed onto the Internet.
|
||||
|
||||
In admin mode, every backend row grows a `⋮` (kebab) menu with `pause`, `resume`, `enable`,
|
||||
`disable`, and `set weight` entries. Lifecycle actions open a confirmation dialog that spells out the
|
||||
dataplane consequence: `disable` explicitly warns that it will drop live sessions via the flow-table
|
||||
flush. The weight dialog has a 0-100 slider and a `flush existing flows` checkbox — unchecked is the
|
||||
flush. The weight dialog has a 0-100 slider and a `flush existing flows` checkbox - unchecked is the
|
||||
graceful drain, checked is the immediate session-drop path.
|
||||
|
||||
Also in admin mode, a **Debug panel** at the bottom of the page tails every event the SPA has seen
|
||||
@@ -623,7 +623,7 @@ write-path to this construction.
|
||||
## What's Next
|
||||
|
||||
Using Maglev has a few significant benefits. Most importantly, I can drain (or weather an outage of)
|
||||
any NGINX frontend within seconds, and there is no more DNS propagation delay. Another key property
|
||||
any nginx frontend within seconds, and there is no more DNS propagation delay. Another key property
|
||||
is that the loadbalanced VIPs themselves are now completely mobile, and anycasted. I can drain a VPP
|
||||
loadbalancer by simply removing its announcement of the VIPs, and anycast routing will seamlessly
|
||||
move the traffic to another live replica. This immunizes IPng from site / datacenter / machine
|
||||
@@ -634,4 +634,4 @@ having "the webserver", there are now half a dozen webservers, and potentially h
|
||||
where traffic can enter the system, which poses a challenge with observability. In an upcoming
|
||||
article, I'll spend some time thinking through how to make it as easy as possible, with Prometheus
|
||||
and Grafana dashboards, as well as a clever trick to be able to see which Maglev loadbalancer sent
|
||||
which request to which IPng NGINX Frontend. If this type of thing is interesting to you, stay tuned!
|
||||
which request to which IPng nginx Frontend. If this type of thing is interesting to you, stay tuned!
|
||||
|
||||
Reference in New Issue
Block a user