Maglev part1
continuous-integration/drone/push Build is passing

This commit is contained in:
2026-04-30 10:45:14 +02:00
parent 04ef0c83be
commit c3a7249bcc
3 changed files with 358 additions and 0 deletions
+352
View File
@@ -0,0 +1,352 @@
---
date: "2026-04-30T06:35:14Z"
title: VPP with Maglev Loadbalancing - Part 1
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
Load balancing is one of those topics that sounds deceptively simple until you think about it for a
while. In this article I take the VPP load balancer plugin out for a spin, fix a handful of API bugs,
and add two small new features that make running it in production a little bit easier.
## Introduction
IPng runs services that want to be reachable via as few public IP addresses as possible. Let's say I
want to run a DNS resolver or authoritative nameserver or even the IPng website, but I want these to
be highly available and perhaps scale to more traffic than one backend server could provide. What
are my options?
My first option is just put a bunch of servers online and give them all an A/AAAA record, and put
them all in DNS, say 7 webservers, and then point `ipng.ch` to those. It's clumsy, notably if one
server is down for maintenance or failure, one seventh of the traffic may still want to reach it.
Also, removing a server will have lots of lingering traffic stay on the webserver, as clients are
sometimes slow to pick up the DNS changes, even if my TTL is low.
Let me show you an example:
{{< image width="100%" src="/assets/vpp-maglev/qps-before.png" alt="NGINX qps per instance" >}}
There are two main problems with this graph:
1. ***Load imbalance***: there are seven webservers in this graph, but somehow only three of them
are getting traffic, the others are not. One is much more heavily loaded (`nginx0.chrma0`) than the
others. It's receiving 1.2kqps while others are receiving ~40qps. This poses a risk when the clients
that are somehow attracted to this instance grow, they may overwhelm this little webserver, even if
there are six others that could help out!
2. ***Drains take _forever_***: The green graph was a drain of `nginx0.nlams2` due to a pending
maintenance window as the datacenter is closing and the server needs to be physically moved. I put
in the DNS change at around 16:15 UTC and the traffic finally dropped at 21:45, a full five hours
(!) later. And believe it or not, the TTL was 15 minutes on these records. Some clients just don't
get the hint ...
### Load balancing 101
A naive load balancing solution is to simply round-robin: send each new packet to the next backend in the list.
That works reasonably well for stateless UDP traffic like DNS, although even with DNS there is a gotcha: some
DNS queries need TCP, for example those that are too big to fit in a single UDP packet, and they
will not be tolerant of naive packet round robin. For TCP this naive load balancing solution
quickly falls apart, because every packet in a connection needs to reach the *same* backend. Sending
a SYN to backend A and the subsequent ACK to backend B will not establish a TCP connection.
The classical answer is to keep per-session state on the load balancer: a table that maps a 5-tuple of
{source IP, destination IP, source port, destination port, protocol} to a chosen backend. That
works, but it introduces a stateful bottleneck. At line rate on a load balancer handling millions of
flows and packets/sec, maintaining and synchronising that table across multiple CPU threads is expensive. It also means that if the
load balancer restarts, every existing TCP session breaks.
What if there was some form of *consistent hashing*: given the 5-tuple of a packet, the load balancer
might always select the same backend deterministically, without storing any per-session state. If backends come and go, only
the flows that were assigned to the changed backend are affected — all other flows keep working.
Google solved this problem at scale and published their solution. They call it Maglev.
## Introducing Maglev
{{< image width="12em" float="right" src="/assets/vpp-maglev/maglev.png" alt="Icon of a maglev train" >}}
Google's Maglev load balancer has been running in production since 2008 and I happen to know several
of its authors - as a personal aside I was sad to learn that Cody Smith, with whom I shared an office
and a team for many years, passed away earlier this year. Rest in peace, Cody!
The Google team published their design at NSDI 2016 in the paper
[[Maglev: A Fast and Reliable Software Network Load Balancer](https://research.google/pubs/maglev-a-fast-and-reliable-software-network-load-balancer/)].
It is worth reading in full — the paper is well written and covers not only the hashing algorithm
but also the wider architecture of how Google handles frontend traffic at scale.
The key insight is that Maglev uses a pre-computed lookup table of size M (some large prime number,
65537 in the paper) filled with backend indices. To handle a packet, the forwarder computes a
hash over the 5-tuple modulo M, looks up the table, and forwards to whatever backend is stored
there. No per-session state is needed avoiding the need for session matching and lots of RAM, and
the flow lookup can be done super efficiently.
### The Maglev new flow table
The interesting part is _how_ that lookup table is filled. I learn that a simple approach might be
to divide M slots evenly among N backends. That would work, but removing a backend would shift every
remaining backend's range, disrupting all flows and resetting TCP connections all over the place.
Maglev uses a smarter fill algorithm:
1. For each backend _i_, derive two independent hash values from its identity (typically its IP
address): an offset and a skip value. These define a *preference list* — a permutation of all M
slots that this backend would like to occupy, in preference order.
1. Iterate over all backends round-robin. Each backend claims its next preferred slot if it is
still free. Continue until every slot is filled.
The result is a table where each backend occupies approximately M/N slots, the distribution is
uniform, and most importantly, adding or removing one backend only displaces approximately 1/N of
the flows. All other flows keep hashing to the same backend. Slick!
### The Maglev existing flow hash table
Consistent hashing handles the common case well, but there is one subtlety: the hashing
guarantees that the *same 5-tuple always maps to the same backend*, but only as long as the set
of backends does not change. If a backend is added mid-stream, a fraction of existing TCP
connections will start hashing to a different backend.
To protect long-lived connections, Maglev keeps a small per-CPU *flow hash table*: an LRU cache of
recently seen 5-tuple to backend mappings. For every packet:
1. Look up in the Maglev flow hash table. On a hit, forward to the cached backend (even if the Maglev
table would now say something different).
1. On a miss, look up the Maglev new-flow table, select the backend, and insert the mapping into the
flow hash table.
The flow hash table does not need to be exhaustive — it only needs to cover *active* connections. An
LRU eviction policy handles the rest. This means the load balancer is *mostly* stateless, as the
Maglev table is deterministic and identical on every CPU, with just enough per-connection state
to protect existing TCP sessions from transient backend changes.
### VPP LB: Plugin anatomy
The VPP load balancer plugin lives in `src/plugins/lb/`. Its core data structures map directly to
the Maglev design:
* ***VIP*** (Virtual IP): a prefix plus an optional {protocol, port} pair. This is the
public-facing address that clients connect to. A VIP can be protocol-agnostic and forward all
traffic to its backends, or it can be port-specific and forward only for example TCP/443 to its
backends.
* ***AS*** (Application Server): a backend endpoint associated with a VIP. The plugin
maintains a list of active ASes per VIP.
* ***New flow table***: the Maglev lookup table, computed from the active AS list whenever an
AS is added or removed. Size is configurable, defaulting to 1024 entries. It is filled by the
clever algorithm described above.
* ***Flow hash table***: per-worker LRU hash table of recent {5-tuple → AS} mappings. This is the
connection affinity cache described above.
* ***Encapsulation***: packets are forwarded to the AS by encapsulating them in either GRE
(GRE4 or GRE6), or via L3DSR (direct server return using DSCP remarking). The AS decapsulates and
responds directly to the client, bypassing the load balancer on the return path.
When a new flow arrives, VPP computes a hash over the 5-tuple modulo the length of its
new_flow_table, it then looks up the backend that will serve this client, stores it in the
per-worker flow hash table, and encapsulates the packet towards the AS. Subsequent packets for the
same 5-tuple hit the flow hash table directly, skipping the Maglev lookup entirely.
A garbage collection timer periodically walks the flow table and removes entries for backends that
have become inactive, preventing stale flows from reaching a long-gone AS. Operators can also remove
these AS, and flush existing connections to them.
#### Observations
After reading the LB code in VPP, I am ready to make a few observations.
***1: Lameduck*** I have the choice of 'remove AS from VIP', by removing it from the Maglev new-flow
table it will not get new flows assigned but if there are long-lived clients, the server will keep
connections open potentially indefinitely. A good example is a websocket that streams data between
a client and the webserver: it never disconnects!
My other choice is to 'Remove and flush AS from VIP', which will also remove it from being eligible
for new flows, but forcibly remove all existing flows from the flow hash table at the same time.
Yikes.
I want a middle ground, operationally:
1. Remove AS from VIP for _new connections_ while keeping existing ones for a grace period. This
is commonly referred to as _lameduck_ mode.
1. Remove AS from VIP for _all connections_, which will reset any lingering connections and move
them to another backend where they reconnect and continue on their journey.
***2: Slow undrain***: From my own experience, adding a new AS needs to often be done carefully, for
two reasons. First, sloshing traffic around can overwhelm a new / freshly started server which does
lazy initialization (for example, a Java binary). Second, a new server may have a different
configuration on purpose, for example different version of the server binary, or different
parameters like caching flags and what-not. It may be good to ease in traffic and inspect it for a
little while before bringing full load onto the server. This is commonly referred to as a _canary_
backend. I'll come back to this later.
### VPP LB: Bugs
While playing around with the plugin's binary API, I ran into a collection of bugs that made the
plugin largely unusable via the API (as opposed to the CLI). I fixed those in Gerrit
[[45428](https://gerrit.fd.io/r/c/vpp/+/45428)].
* ***IPv4 VIP prefixlen offset bug***: `lb_add_del_vip()` was computing the prefix length
incorrectly for IPv4 addresses due to an off-by-one in the address family handling, producing
VIPs that silently matched no traffic.
* ***Wrong encap type on VIP create***: Both `lb_add_del_vip()` and `lb_add_del_vip_v2()`
were passing the encapsulation type through an incorrect enum mapping, so a VIP created with GRE4
encap via the API would actually end up configured with a different encap type internally.
* ***lb_vip_dump() returning wrong fields***: The dump handler was returning a stale encap
type and an incorrect protocol value, making it impossible to verify what was actually configured
via the API.
* ***lb_as_dump() port filter broken***: The AS dump call accepts an optional VIP filter. The
port comparison was being done against an uninitialized variable, causing the filter to miss
entries or match wrong ones depending on stack contents.
* ***Missing lb_conf_get()***: There was no API call to retrieve the global LB configuration
(flow table size, timeout values). I added `lb_conf_get()` so an operator or controlplane can verify
the running configuration without resorting to CLI parsing.
* ***'show lb vips' unformatting error***: The CLI handler dereferenced a pointer that
is only valid in verbose mode, causing unexpected output (and a possible crash!) on a plain `show lb
vips`.
* ***GC only triggered by CLI input***: The garbage collector for the flow table was only
invoked when the operator typed a CLI command. On a production load balancer, stale flow entries
would accumulate indefinitely. So I added a periodic GC timer that automatically cleans up the flow
hash table.
While discussing on the `vpp-dev` mailing list, my buddy Jerome Tollet independently found two of
these bugs (the encap type mismatch and the dump port filter) and reported them during review. Both
are addressed in the latest patchset.
### VPP LB: New Feature - Weights
My attempt to address the two observations above comes from an insight that they are actually the
same class of problem: I want to be able to set a variable amount of traffic anywhere from 100% all
the way down to 0% of load that a given backend is capable of handling, and I want to be able to
flush (remove existing flows from the flow hash table) independently of the new-flow assignment.
This is commonly referred to as _weights_ in a load balancer, and in Gerrit
[[45487](https://gerrit.fd.io/r/c/vpp/+/45487)] I add per-AS weights to the Maglev new flow table,
and decouple 'flush' from 'set weight' semantically.
The motivation comes from the two operational scenarios I kept running into while testing the plugin:
**1. Draining a backend without disrupting existing sessions.** When a backend needs to go down for
maintenance, the only option was `lb as del flush`, which both removes the AS *and* flushes the
flow table. Flushing the flow table is disruptive: all existing TCP sessions that were pinned to
any backend suddenly need to re-select, causing a brief spike of misdirected packets. What I
actually want is to stop sending *new* flows to the AS while letting existing sessions drain
naturally.
**2. Introducing a new backend gradually.** When adding a new AS to a busy VIP, the Maglev algorithm
immediately assigns it ~1/N of the new-flow table slots. On a VIP handling tens of thousands of
new connections per second, that is a lot of traffic hitting a backend that may not yet be fully
warmed up (think JVM JIT, filled caches, established database connections). It would be useful to
introduce the new AS slowly and ramp it up over time.
My solution for both is to allow each AS to carry a weight in the range 0100, which controls what
fraction of the new flow table slots it is allowed to occupy:
* ***weight 100*** (default): the AS gets its full ~1/N share of slots. This is the existing
behavior, and remains the default.
* ***weight 199***: the AS gets a proportionally smaller share. Useful for gradual introduction
as well as gradual removal.
* ***weight 0***: the AS gets no slots in the new flow table — no new flows are sent to it. The
flow table entries for existing sessions remain intact, so those connections keep working until
they naturally expire.
The Maglev fill algorithm is made weight-aware by scaling each AS's preference list length
proportionally to its weight. The sort order is deterministic (sorted by `(replica, address)`)
so the resulting table is identical regardless of the order ASes were added, which also has a bonus
side effect of making anycast and ECMP VIPs work correctly.
Because VPP developers do not change API signatures once they are published, I added a few new API
calls instead:
* `lb_as_add_del_v2()` — creates or deletes an AS with an explicit weight, and optionally
flushes the flow table for that AS on deletion.
* `lb_as_dump_v2()` — returns the weight and the number of new-flow-table buckets currently
assigned to each AS, which is useful for verifying the distribution.
* `lb_as_set_weight()` — changes the weight of an existing AS in place, optionally flushing
the flow table, without needing to delete and recreate the AS.
From the CLI, the weight is set with:
```
vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0
vpp# lb as 192.0.2.0/32 10.0.0.1 weight 1
vpp# lb as 192.0.2.0/32 10.0.0.1 weight 10
vpp# lb as 192.0.2.0/32 10.0.0.1 weight 100
vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0
vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0 flush
```
In the sequence above, backend AS `10.0.0.1` starts off fully drained, then getting a token amount
of traffic by setting it to weight 1, then 10, and finally 100. When the backend needs to be
removed, I can set `weight 0` which will put it in _lameduck_ mode but keep existing flows alive.
A few minutes later, I can set it to `weight 0 flush` which will remove the remaining existing
flows. The backend then can be safely removed, without having to wait 5+ hours like I did with the
uncontrolled DNS 'drain'.
### VPP LB: New Feature - Punt Unknown
I'm still on the fence on this feature, but since I wrote it .. Gerrit
[[45431](https://gerrit.fd.io/r/c/vpp/+/45431)] adds a `punt` flag to port-based VIPs.
By default, when a VIP is configured with a specific protocol and port (e.g. TCP/443), any packet
that arrives at that VIP's address but does *not* match the configured {protocol, port} pair is
sent by VPP to `error-drop`. This is the correct behavior for most cases: if I am load balancing
TCP/443, I do not want stray UDP packets forwarded anywhere.
The problem is that this also drops ICMP. If an operator runs `traceroute` towards the VIP, or
sends an ICMP echo, or a client receives an ICMP unreachable, all of that is silently discarded.
This makes the VIP opaque from the network's perspective and can complicate debugging.
When creating a port-based VIP, I decide to add a `punt` flag, so any traffic that does not match
the configured protocol/port pairs on the VIP will newly be punted to the local IP stack
(`ip4-local` or `ip6-local`) instead of dropped. To make this work, I ask VPP to insert the VIP's
address into the FIB at a higher priority than device routes, so the punt path is actually
reachable. This allows the load balancer to handle TCP/443 (or whatever protocol/port combinations
are configured) while the local stack takes care of ICMP, traceroute, and anything else that arrives
at that address and is not a part of the maglev configuration.
The `punt` flag is only permitted on port-based VIPs — on a protocol-agnostic VIP there is
nothing left to punt, since all traffic is already matched and forwarded to application servers.
Enabling this from the CLI is straightforward, at creation time:
```
vpp# loopback create interface instance 0
vpp# lcp create loop0 host-if maglev0
vpp# set int state loop0 up
vpp# set int ip address loop0 192.0.2.0/32
vpp# lb vip 192.0.2.0/32 protocol tcp port 443 encap gre4 punt
```
In this configuration snippet, I first create a simple loopback device with a given IPv4 address,
and plumb it through to Linux using the [[Linux CP]({{< ref 2021-08-12-vpp-1 >}})] plugin. This makes
it reachable, I can ping it and traceroute to it just like any other Linux Interface Pair _LIP_.
Then, I _steal_ some traffic from it, by creating an LB VIP on this address. Without this feature,
the VIP would become unreachable, as the LB plugin would take all traffic destined to the IPv4
address. But with the `punt` keyword, any traffic not matching the LB VIP(s) on this address, will
be sent onwards to the IP stack and end up in Linux. For those of us who like pinging their VIPs,
the `punt` feature flag on VIPs will come in handy.
For the same reason as with the other feature I wrote, I need to add new API calls rather than
changing existing ones, so here I go:
* `lb_add_del_vip_v3()` — adds a `is_punt` flag to the VIP creation call.
* `lb_vip_dump_v2()` — returns `is_punt` in the VIP details, so an operator or controlplane can
verify the configuration.
## What's Next
I am going to use Maglev at IPng Networks to load balance our services like SMTP, IMAP, HTTP, DNS and
what-not. But before I can do that, I'm going to want to write some sort of controlplane that can
manipulate the VIPs, AS weights, and do things like health checking. I'm inspired by
[[HAProxy](https://haproxy.org/)] which I used to use way back when. I find its health checking
algorithm particularly clever, so I will give that codebase a good read and with what I learn,
create a health checking VPP Maglev controlplane which will give me much better insight into what
traffic goes where.
Stay tuned!
Binary file not shown.
Binary file not shown.