@@ -0,0 +1,352 @@
|
||||
---
|
||||
date: "2026-04-30T06:35:14Z"
|
||||
title: VPP with Maglev Loadbalancing - Part 1
|
||||
---
|
||||
|
||||
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
||||
|
||||
# About this series
|
||||
|
||||
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
|
||||
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
|
||||
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
|
||||
are shared between the two.
|
||||
|
||||
Load balancing is one of those topics that sounds deceptively simple until you think about it for a
|
||||
while. In this article I take the VPP load balancer plugin out for a spin, fix a handful of API bugs,
|
||||
and add two small new features that make running it in production a little bit easier.
|
||||
|
||||
## Introduction
|
||||
|
||||
IPng runs services that want to be reachable via as few public IP addresses as possible. Let's say I
|
||||
want to run a DNS resolver or authoritative nameserver or even the IPng website, but I want these to
|
||||
be highly available and perhaps scale to more traffic than one backend server could provide. What
|
||||
are my options?
|
||||
|
||||
My first option is just put a bunch of servers online and give them all an A/AAAA record, and put
|
||||
them all in DNS, say 7 webservers, and then point `ipng.ch` to those. It's clumsy, notably if one
|
||||
server is down for maintenance or failure, one seventh of the traffic may still want to reach it.
|
||||
Also, removing a server will have lots of lingering traffic stay on the webserver, as clients are
|
||||
sometimes slow to pick up the DNS changes, even if my TTL is low.
|
||||
|
||||
Let me show you an example:
|
||||
|
||||
{{< image width="100%" src="/assets/vpp-maglev/qps-before.png" alt="NGINX qps per instance" >}}
|
||||
|
||||
There are two main problems with this graph:
|
||||
|
||||
1. ***Load imbalance***: there are seven webservers in this graph, but somehow only three of them
|
||||
are getting traffic, the others are not. One is much more heavily loaded (`nginx0.chrma0`) than the
|
||||
others. It's receiving 1.2kqps while others are receiving ~40qps. This poses a risk when the clients
|
||||
that are somehow attracted to this instance grow, they may overwhelm this little webserver, even if
|
||||
there are six others that could help out!
|
||||
|
||||
2. ***Drains take _forever_***: The green graph was a drain of `nginx0.nlams2` due to a pending
|
||||
maintenance window as the datacenter is closing and the server needs to be physically moved. I put
|
||||
in the DNS change at around 16:15 UTC and the traffic finally dropped at 21:45, a full five hours
|
||||
(!) later. And believe it or not, the TTL was 15 minutes on these records. Some clients just don't
|
||||
get the hint ...
|
||||
|
||||
### Load balancing 101
|
||||
|
||||
A naive load balancing solution is to simply round-robin: send each new packet to the next backend in the list.
|
||||
That works reasonably well for stateless UDP traffic like DNS, although even with DNS there is a gotcha: some
|
||||
DNS queries need TCP, for example those that are too big to fit in a single UDP packet, and they
|
||||
will not be tolerant of naive packet round robin. For TCP this naive load balancing solution
|
||||
quickly falls apart, because every packet in a connection needs to reach the *same* backend. Sending
|
||||
a SYN to backend A and the subsequent ACK to backend B will not establish a TCP connection.
|
||||
|
||||
|
||||
The classical answer is to keep per-session state on the load balancer: a table that maps a 5-tuple of
|
||||
{source IP, destination IP, source port, destination port, protocol} to a chosen backend. That
|
||||
works, but it introduces a stateful bottleneck. At line rate on a load balancer handling millions of
|
||||
flows and packets/sec, maintaining and synchronising that table across multiple CPU threads is expensive. It also means that if the
|
||||
load balancer restarts, every existing TCP session breaks.
|
||||
|
||||
What if there was some form of *consistent hashing*: given the 5-tuple of a packet, the load balancer
|
||||
might always select the same backend deterministically, without storing any per-session state. If backends come and go, only
|
||||
the flows that were assigned to the changed backend are affected — all other flows keep working.
|
||||
Google solved this problem at scale and published their solution. They call it Maglev.
|
||||
|
||||
## Introducing Maglev
|
||||
|
||||
{{< image width="12em" float="right" src="/assets/vpp-maglev/maglev.png" alt="Icon of a maglev train" >}}
|
||||
|
||||
Google's Maglev load balancer has been running in production since 2008 and I happen to know several
|
||||
of its authors - as a personal aside I was sad to learn that Cody Smith, with whom I shared an office
|
||||
and a team for many years, passed away earlier this year. Rest in peace, Cody!
|
||||
|
||||
The Google team published their design at NSDI 2016 in the paper
|
||||
[[Maglev: A Fast and Reliable Software Network Load Balancer](https://research.google/pubs/maglev-a-fast-and-reliable-software-network-load-balancer/)].
|
||||
It is worth reading in full — the paper is well written and covers not only the hashing algorithm
|
||||
but also the wider architecture of how Google handles frontend traffic at scale.
|
||||
|
||||
The key insight is that Maglev uses a pre-computed lookup table of size M (some large prime number,
|
||||
65537 in the paper) filled with backend indices. To handle a packet, the forwarder computes a
|
||||
hash over the 5-tuple modulo M, looks up the table, and forwards to whatever backend is stored
|
||||
there. No per-session state is needed avoiding the need for session matching and lots of RAM, and
|
||||
the flow lookup can be done super efficiently.
|
||||
|
||||
### The Maglev new flow table
|
||||
|
||||
The interesting part is _how_ that lookup table is filled. I learn that a simple approach might be
|
||||
to divide M slots evenly among N backends. That would work, but removing a backend would shift every
|
||||
remaining backend's range, disrupting all flows and resetting TCP connections all over the place.
|
||||
Maglev uses a smarter fill algorithm:
|
||||
|
||||
1. For each backend _i_, derive two independent hash values from its identity (typically its IP
|
||||
address): an offset and a skip value. These define a *preference list* — a permutation of all M
|
||||
slots that this backend would like to occupy, in preference order.
|
||||
1. Iterate over all backends round-robin. Each backend claims its next preferred slot if it is
|
||||
still free. Continue until every slot is filled.
|
||||
|
||||
The result is a table where each backend occupies approximately M/N slots, the distribution is
|
||||
uniform, and most importantly, adding or removing one backend only displaces approximately 1/N of
|
||||
the flows. All other flows keep hashing to the same backend. Slick!
|
||||
|
||||
### The Maglev existing flow hash table
|
||||
|
||||
Consistent hashing handles the common case well, but there is one subtlety: the hashing
|
||||
guarantees that the *same 5-tuple always maps to the same backend*, but only as long as the set
|
||||
of backends does not change. If a backend is added mid-stream, a fraction of existing TCP
|
||||
connections will start hashing to a different backend.
|
||||
|
||||
To protect long-lived connections, Maglev keeps a small per-CPU *flow hash table*: an LRU cache of
|
||||
recently seen 5-tuple to backend mappings. For every packet:
|
||||
|
||||
1. Look up in the Maglev flow hash table. On a hit, forward to the cached backend (even if the Maglev
|
||||
table would now say something different).
|
||||
1. On a miss, look up the Maglev new-flow table, select the backend, and insert the mapping into the
|
||||
flow hash table.
|
||||
|
||||
The flow hash table does not need to be exhaustive — it only needs to cover *active* connections. An
|
||||
LRU eviction policy handles the rest. This means the load balancer is *mostly* stateless, as the
|
||||
Maglev table is deterministic and identical on every CPU, with just enough per-connection state
|
||||
to protect existing TCP sessions from transient backend changes.
|
||||
|
||||
### VPP LB: Plugin anatomy
|
||||
|
||||
The VPP load balancer plugin lives in `src/plugins/lb/`. Its core data structures map directly to
|
||||
the Maglev design:
|
||||
|
||||
* ***VIP*** (Virtual IP): a prefix plus an optional {protocol, port} pair. This is the
|
||||
public-facing address that clients connect to. A VIP can be protocol-agnostic and forward all
|
||||
traffic to its backends, or it can be port-specific and forward only for example TCP/443 to its
|
||||
backends.
|
||||
* ***AS*** (Application Server): a backend endpoint associated with a VIP. The plugin
|
||||
maintains a list of active ASes per VIP.
|
||||
* ***New flow table***: the Maglev lookup table, computed from the active AS list whenever an
|
||||
AS is added or removed. Size is configurable, defaulting to 1024 entries. It is filled by the
|
||||
clever algorithm described above.
|
||||
* ***Flow hash table***: per-worker LRU hash table of recent {5-tuple → AS} mappings. This is the
|
||||
connection affinity cache described above.
|
||||
* ***Encapsulation***: packets are forwarded to the AS by encapsulating them in either GRE
|
||||
(GRE4 or GRE6), or via L3DSR (direct server return using DSCP remarking). The AS decapsulates and
|
||||
responds directly to the client, bypassing the load balancer on the return path.
|
||||
|
||||
When a new flow arrives, VPP computes a hash over the 5-tuple modulo the length of its
|
||||
new_flow_table, it then looks up the backend that will serve this client, stores it in the
|
||||
per-worker flow hash table, and encapsulates the packet towards the AS. Subsequent packets for the
|
||||
same 5-tuple hit the flow hash table directly, skipping the Maglev lookup entirely.
|
||||
|
||||
A garbage collection timer periodically walks the flow table and removes entries for backends that
|
||||
have become inactive, preventing stale flows from reaching a long-gone AS. Operators can also remove
|
||||
these AS, and flush existing connections to them.
|
||||
|
||||
#### Observations
|
||||
|
||||
After reading the LB code in VPP, I am ready to make a few observations.
|
||||
|
||||
***1: Lameduck*** I have the choice of 'remove AS from VIP', by removing it from the Maglev new-flow
|
||||
table it will not get new flows assigned but if there are long-lived clients, the server will keep
|
||||
connections open potentially indefinitely. A good example is a websocket that streams data between
|
||||
a client and the webserver: it never disconnects!
|
||||
|
||||
My other choice is to 'Remove and flush AS from VIP', which will also remove it from being eligible
|
||||
for new flows, but forcibly remove all existing flows from the flow hash table at the same time.
|
||||
Yikes.
|
||||
|
||||
I want a middle ground, operationally:
|
||||
1. Remove AS from VIP for _new connections_ while keeping existing ones for a grace period. This
|
||||
is commonly referred to as _lameduck_ mode.
|
||||
1. Remove AS from VIP for _all connections_, which will reset any lingering connections and move
|
||||
them to another backend where they reconnect and continue on their journey.
|
||||
|
||||
***2: Slow undrain***: From my own experience, adding a new AS needs to often be done carefully, for
|
||||
two reasons. First, sloshing traffic around can overwhelm a new / freshly started server which does
|
||||
lazy initialization (for example, a Java binary). Second, a new server may have a different
|
||||
configuration on purpose, for example different version of the server binary, or different
|
||||
parameters like caching flags and what-not. It may be good to ease in traffic and inspect it for a
|
||||
little while before bringing full load onto the server. This is commonly referred to as a _canary_
|
||||
backend. I'll come back to this later.
|
||||
|
||||
### VPP LB: Bugs
|
||||
|
||||
While playing around with the plugin's binary API, I ran into a collection of bugs that made the
|
||||
plugin largely unusable via the API (as opposed to the CLI). I fixed those in Gerrit
|
||||
[[45428](https://gerrit.fd.io/r/c/vpp/+/45428)].
|
||||
|
||||
* ***IPv4 VIP prefixlen offset bug***: `lb_add_del_vip()` was computing the prefix length
|
||||
incorrectly for IPv4 addresses due to an off-by-one in the address family handling, producing
|
||||
VIPs that silently matched no traffic.
|
||||
|
||||
* ***Wrong encap type on VIP create***: Both `lb_add_del_vip()` and `lb_add_del_vip_v2()`
|
||||
were passing the encapsulation type through an incorrect enum mapping, so a VIP created with GRE4
|
||||
encap via the API would actually end up configured with a different encap type internally.
|
||||
|
||||
* ***lb_vip_dump() returning wrong fields***: The dump handler was returning a stale encap
|
||||
type and an incorrect protocol value, making it impossible to verify what was actually configured
|
||||
via the API.
|
||||
|
||||
* ***lb_as_dump() port filter broken***: The AS dump call accepts an optional VIP filter. The
|
||||
port comparison was being done against an uninitialized variable, causing the filter to miss
|
||||
entries or match wrong ones depending on stack contents.
|
||||
|
||||
* ***Missing lb_conf_get()***: There was no API call to retrieve the global LB configuration
|
||||
(flow table size, timeout values). I added `lb_conf_get()` so an operator or controlplane can verify
|
||||
the running configuration without resorting to CLI parsing.
|
||||
|
||||
* ***'show lb vips' unformatting error***: The CLI handler dereferenced a pointer that
|
||||
is only valid in verbose mode, causing unexpected output (and a possible crash!) on a plain `show lb
|
||||
vips`.
|
||||
|
||||
* ***GC only triggered by CLI input***: The garbage collector for the flow table was only
|
||||
invoked when the operator typed a CLI command. On a production load balancer, stale flow entries
|
||||
would accumulate indefinitely. So I added a periodic GC timer that automatically cleans up the flow
|
||||
hash table.
|
||||
|
||||
While discussing on the `vpp-dev` mailing list, my buddy Jerome Tollet independently found two of
|
||||
these bugs (the encap type mismatch and the dump port filter) and reported them during review. Both
|
||||
are addressed in the latest patchset.
|
||||
|
||||
### VPP LB: New Feature - Weights
|
||||
|
||||
My attempt to address the two observations above comes from an insight that they are actually the
|
||||
same class of problem: I want to be able to set a variable amount of traffic anywhere from 100% all
|
||||
the way down to 0% of load that a given backend is capable of handling, and I want to be able to
|
||||
flush (remove existing flows from the flow hash table) independently of the new-flow assignment.
|
||||
This is commonly referred to as _weights_ in a load balancer, and in Gerrit
|
||||
[[45487](https://gerrit.fd.io/r/c/vpp/+/45487)] I add per-AS weights to the Maglev new flow table,
|
||||
and decouple 'flush' from 'set weight' semantically.
|
||||
|
||||
The motivation comes from the two operational scenarios I kept running into while testing the plugin:
|
||||
|
||||
**1. Draining a backend without disrupting existing sessions.** When a backend needs to go down for
|
||||
maintenance, the only option was `lb as del flush`, which both removes the AS *and* flushes the
|
||||
flow table. Flushing the flow table is disruptive: all existing TCP sessions that were pinned to
|
||||
any backend suddenly need to re-select, causing a brief spike of misdirected packets. What I
|
||||
actually want is to stop sending *new* flows to the AS while letting existing sessions drain
|
||||
naturally.
|
||||
|
||||
**2. Introducing a new backend gradually.** When adding a new AS to a busy VIP, the Maglev algorithm
|
||||
immediately assigns it ~1/N of the new-flow table slots. On a VIP handling tens of thousands of
|
||||
new connections per second, that is a lot of traffic hitting a backend that may not yet be fully
|
||||
warmed up (think JVM JIT, filled caches, established database connections). It would be useful to
|
||||
introduce the new AS slowly and ramp it up over time.
|
||||
|
||||
My solution for both is to allow each AS to carry a weight in the range 0–100, which controls what
|
||||
fraction of the new flow table slots it is allowed to occupy:
|
||||
|
||||
* ***weight 100*** (default): the AS gets its full ~1/N share of slots. This is the existing
|
||||
behavior, and remains the default.
|
||||
* ***weight 1–99***: the AS gets a proportionally smaller share. Useful for gradual introduction
|
||||
as well as gradual removal.
|
||||
* ***weight 0***: the AS gets no slots in the new flow table — no new flows are sent to it. The
|
||||
flow table entries for existing sessions remain intact, so those connections keep working until
|
||||
they naturally expire.
|
||||
|
||||
The Maglev fill algorithm is made weight-aware by scaling each AS's preference list length
|
||||
proportionally to its weight. The sort order is deterministic (sorted by `(replica, address)`)
|
||||
so the resulting table is identical regardless of the order ASes were added, which also has a bonus
|
||||
side effect of making anycast and ECMP VIPs work correctly.
|
||||
|
||||
Because VPP developers do not change API signatures once they are published, I added a few new API
|
||||
calls instead:
|
||||
|
||||
* `lb_as_add_del_v2()` — creates or deletes an AS with an explicit weight, and optionally
|
||||
flushes the flow table for that AS on deletion.
|
||||
* `lb_as_dump_v2()` — returns the weight and the number of new-flow-table buckets currently
|
||||
assigned to each AS, which is useful for verifying the distribution.
|
||||
* `lb_as_set_weight()` — changes the weight of an existing AS in place, optionally flushing
|
||||
the flow table, without needing to delete and recreate the AS.
|
||||
|
||||
From the CLI, the weight is set with:
|
||||
|
||||
```
|
||||
vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0
|
||||
vpp# lb as 192.0.2.0/32 10.0.0.1 weight 1
|
||||
vpp# lb as 192.0.2.0/32 10.0.0.1 weight 10
|
||||
vpp# lb as 192.0.2.0/32 10.0.0.1 weight 100
|
||||
vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0
|
||||
vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0 flush
|
||||
```
|
||||
|
||||
In the sequence above, backend AS `10.0.0.1` starts off fully drained, then getting a token amount
|
||||
of traffic by setting it to weight 1, then 10, and finally 100. When the backend needs to be
|
||||
removed, I can set `weight 0` which will put it in _lameduck_ mode but keep existing flows alive.
|
||||
A few minutes later, I can set it to `weight 0 flush` which will remove the remaining existing
|
||||
flows. The backend then can be safely removed, without having to wait 5+ hours like I did with the
|
||||
uncontrolled DNS 'drain'.
|
||||
|
||||
### VPP LB: New Feature - Punt Unknown
|
||||
|
||||
I'm still on the fence on this feature, but since I wrote it .. Gerrit
|
||||
[[45431](https://gerrit.fd.io/r/c/vpp/+/45431)] adds a `punt` flag to port-based VIPs.
|
||||
|
||||
By default, when a VIP is configured with a specific protocol and port (e.g. TCP/443), any packet
|
||||
that arrives at that VIP's address but does *not* match the configured {protocol, port} pair is
|
||||
sent by VPP to `error-drop`. This is the correct behavior for most cases: if I am load balancing
|
||||
TCP/443, I do not want stray UDP packets forwarded anywhere.
|
||||
|
||||
The problem is that this also drops ICMP. If an operator runs `traceroute` towards the VIP, or
|
||||
sends an ICMP echo, or a client receives an ICMP unreachable, all of that is silently discarded.
|
||||
This makes the VIP opaque from the network's perspective and can complicate debugging.
|
||||
|
||||
When creating a port-based VIP, I decide to add a `punt` flag, so any traffic that does not match
|
||||
the configured protocol/port pairs on the VIP will newly be punted to the local IP stack
|
||||
(`ip4-local` or `ip6-local`) instead of dropped. To make this work, I ask VPP to insert the VIP's
|
||||
address into the FIB at a higher priority than device routes, so the punt path is actually
|
||||
reachable. This allows the load balancer to handle TCP/443 (or whatever protocol/port combinations
|
||||
are configured) while the local stack takes care of ICMP, traceroute, and anything else that arrives
|
||||
at that address and is not a part of the maglev configuration.
|
||||
|
||||
The `punt` flag is only permitted on port-based VIPs — on a protocol-agnostic VIP there is
|
||||
nothing left to punt, since all traffic is already matched and forwarded to application servers.
|
||||
|
||||
Enabling this from the CLI is straightforward, at creation time:
|
||||
|
||||
```
|
||||
vpp# loopback create interface instance 0
|
||||
vpp# lcp create loop0 host-if maglev0
|
||||
vpp# set int state loop0 up
|
||||
vpp# set int ip address loop0 192.0.2.0/32
|
||||
vpp# lb vip 192.0.2.0/32 protocol tcp port 443 encap gre4 punt
|
||||
```
|
||||
|
||||
In this configuration snippet, I first create a simple loopback device with a given IPv4 address,
|
||||
and plumb it through to Linux using the [[Linux CP]({{< ref 2021-08-12-vpp-1 >}})] plugin. This makes
|
||||
it reachable, I can ping it and traceroute to it just like any other Linux Interface Pair _LIP_.
|
||||
Then, I _steal_ some traffic from it, by creating an LB VIP on this address. Without this feature,
|
||||
the VIP would become unreachable, as the LB plugin would take all traffic destined to the IPv4
|
||||
address. But with the `punt` keyword, any traffic not matching the LB VIP(s) on this address, will
|
||||
be sent onwards to the IP stack and end up in Linux. For those of us who like pinging their VIPs,
|
||||
the `punt` feature flag on VIPs will come in handy.
|
||||
|
||||
For the same reason as with the other feature I wrote, I need to add new API calls rather than
|
||||
changing existing ones, so here I go:
|
||||
|
||||
* `lb_add_del_vip_v3()` — adds a `is_punt` flag to the VIP creation call.
|
||||
* `lb_vip_dump_v2()` — returns `is_punt` in the VIP details, so an operator or controlplane can
|
||||
verify the configuration.
|
||||
|
||||
## What's Next
|
||||
|
||||
I am going to use Maglev at IPng Networks to load balance our services like SMTP, IMAP, HTTP, DNS and
|
||||
what-not. But before I can do that, I'm going to want to write some sort of controlplane that can
|
||||
manipulate the VIPs, AS weights, and do things like health checking. I'm inspired by
|
||||
[[HAProxy](https://haproxy.org/)] which I used to use way back when. I find its health checking
|
||||
algorithm particularly clever, so I will give that codebase a good read and with what I learn,
|
||||
create a health checking VPP Maglev controlplane which will give me much better insight into what
|
||||
traffic goes where.
|
||||
|
||||
Stay tuned!
|
||||
Reference in New Issue
Block a user