diff --git a/content/articles/2026-04-30-vpp-maglev.md b/content/articles/2026-04-30-vpp-maglev.md new file mode 100644 index 0000000..a65c025 --- /dev/null +++ b/content/articles/2026-04-30-vpp-maglev.md @@ -0,0 +1,352 @@ +--- +date: "2026-04-30T06:35:14Z" +title: VPP with Maglev Loadbalancing - Part 1 +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. + +Load balancing is one of those topics that sounds deceptively simple until you think about it for a +while. In this article I take the VPP load balancer plugin out for a spin, fix a handful of API bugs, +and add two small new features that make running it in production a little bit easier. + +## Introduction + +IPng runs services that want to be reachable via as few public IP addresses as possible. Let's say I +want to run a DNS resolver or authoritative nameserver or even the IPng website, but I want these to +be highly available and perhaps scale to more traffic than one backend server could provide. What +are my options? + +My first option is just put a bunch of servers online and give them all an A/AAAA record, and put +them all in DNS, say 7 webservers, and then point `ipng.ch` to those. It's clumsy, notably if one +server is down for maintenance or failure, one seventh of the traffic may still want to reach it. +Also, removing a server will have lots of lingering traffic stay on the webserver, as clients are +sometimes slow to pick up the DNS changes, even if my TTL is low. + +Let me show you an example: + +{{< image width="100%" src="/assets/vpp-maglev/qps-before.png" alt="NGINX qps per instance" >}} + +There are two main problems with this graph: + +1. ***Load imbalance***: there are seven webservers in this graph, but somehow only three of them +are getting traffic, the others are not. One is much more heavily loaded (`nginx0.chrma0`) than the +others. It's receiving 1.2kqps while others are receiving ~40qps. This poses a risk when the clients +that are somehow attracted to this instance grow, they may overwhelm this little webserver, even if +there are six others that could help out! + +2. ***Drains take _forever_***: The green graph was a drain of `nginx0.nlams2` due to a pending +maintenance window as the datacenter is closing and the server needs to be physically moved. I put +in the DNS change at around 16:15 UTC and the traffic finally dropped at 21:45, a full five hours +(!) later. And believe it or not, the TTL was 15 minutes on these records. Some clients just don't +get the hint ... + +### Load balancing 101 + +A naive load balancing solution is to simply round-robin: send each new packet to the next backend in the list. +That works reasonably well for stateless UDP traffic like DNS, although even with DNS there is a gotcha: some +DNS queries need TCP, for example those that are too big to fit in a single UDP packet, and they +will not be tolerant of naive packet round robin. For TCP this naive load balancing solution +quickly falls apart, because every packet in a connection needs to reach the *same* backend. Sending +a SYN to backend A and the subsequent ACK to backend B will not establish a TCP connection. + + +The classical answer is to keep per-session state on the load balancer: a table that maps a 5-tuple of +{source IP, destination IP, source port, destination port, protocol} to a chosen backend. That +works, but it introduces a stateful bottleneck. At line rate on a load balancer handling millions of +flows and packets/sec, maintaining and synchronising that table across multiple CPU threads is expensive. It also means that if the +load balancer restarts, every existing TCP session breaks. + +What if there was some form of *consistent hashing*: given the 5-tuple of a packet, the load balancer +might always select the same backend deterministically, without storing any per-session state. If backends come and go, only +the flows that were assigned to the changed backend are affected — all other flows keep working. +Google solved this problem at scale and published their solution. They call it Maglev. + +## Introducing Maglev + +{{< image width="12em" float="right" src="/assets/vpp-maglev/maglev.png" alt="Icon of a maglev train" >}} + +Google's Maglev load balancer has been running in production since 2008 and I happen to know several +of its authors - as a personal aside I was sad to learn that Cody Smith, with whom I shared an office +and a team for many years, passed away earlier this year. Rest in peace, Cody! + +The Google team published their design at NSDI 2016 in the paper +[[Maglev: A Fast and Reliable Software Network Load Balancer](https://research.google/pubs/maglev-a-fast-and-reliable-software-network-load-balancer/)]. +It is worth reading in full — the paper is well written and covers not only the hashing algorithm +but also the wider architecture of how Google handles frontend traffic at scale. + +The key insight is that Maglev uses a pre-computed lookup table of size M (some large prime number, +65537 in the paper) filled with backend indices. To handle a packet, the forwarder computes a +hash over the 5-tuple modulo M, looks up the table, and forwards to whatever backend is stored +there. No per-session state is needed avoiding the need for session matching and lots of RAM, and +the flow lookup can be done super efficiently. + +### The Maglev new flow table + +The interesting part is _how_ that lookup table is filled. I learn that a simple approach might be +to divide M slots evenly among N backends. That would work, but removing a backend would shift every +remaining backend's range, disrupting all flows and resetting TCP connections all over the place. +Maglev uses a smarter fill algorithm: + +1. For each backend _i_, derive two independent hash values from its identity (typically its IP +address): an offset and a skip value. These define a *preference list* — a permutation of all M +slots that this backend would like to occupy, in preference order. +1. Iterate over all backends round-robin. Each backend claims its next preferred slot if it is +still free. Continue until every slot is filled. + +The result is a table where each backend occupies approximately M/N slots, the distribution is +uniform, and most importantly, adding or removing one backend only displaces approximately 1/N of +the flows. All other flows keep hashing to the same backend. Slick! + +### The Maglev existing flow hash table + +Consistent hashing handles the common case well, but there is one subtlety: the hashing +guarantees that the *same 5-tuple always maps to the same backend*, but only as long as the set +of backends does not change. If a backend is added mid-stream, a fraction of existing TCP +connections will start hashing to a different backend. + +To protect long-lived connections, Maglev keeps a small per-CPU *flow hash table*: an LRU cache of +recently seen 5-tuple to backend mappings. For every packet: + +1. Look up in the Maglev flow hash table. On a hit, forward to the cached backend (even if the Maglev +table would now say something different). +1. On a miss, look up the Maglev new-flow table, select the backend, and insert the mapping into the +flow hash table. + +The flow hash table does not need to be exhaustive — it only needs to cover *active* connections. An +LRU eviction policy handles the rest. This means the load balancer is *mostly* stateless, as the +Maglev table is deterministic and identical on every CPU, with just enough per-connection state +to protect existing TCP sessions from transient backend changes. + +### VPP LB: Plugin anatomy + +The VPP load balancer plugin lives in `src/plugins/lb/`. Its core data structures map directly to +the Maglev design: + +* ***VIP*** (Virtual IP): a prefix plus an optional {protocol, port} pair. This is the +public-facing address that clients connect to. A VIP can be protocol-agnostic and forward all +traffic to its backends, or it can be port-specific and forward only for example TCP/443 to its +backends. +* ***AS*** (Application Server): a backend endpoint associated with a VIP. The plugin +maintains a list of active ASes per VIP. +* ***New flow table***: the Maglev lookup table, computed from the active AS list whenever an +AS is added or removed. Size is configurable, defaulting to 1024 entries. It is filled by the +clever algorithm described above. +* ***Flow hash table***: per-worker LRU hash table of recent {5-tuple → AS} mappings. This is the +connection affinity cache described above. +* ***Encapsulation***: packets are forwarded to the AS by encapsulating them in either GRE +(GRE4 or GRE6), or via L3DSR (direct server return using DSCP remarking). The AS decapsulates and +responds directly to the client, bypassing the load balancer on the return path. + +When a new flow arrives, VPP computes a hash over the 5-tuple modulo the length of its +new_flow_table, it then looks up the backend that will serve this client, stores it in the +per-worker flow hash table, and encapsulates the packet towards the AS. Subsequent packets for the +same 5-tuple hit the flow hash table directly, skipping the Maglev lookup entirely. + +A garbage collection timer periodically walks the flow table and removes entries for backends that +have become inactive, preventing stale flows from reaching a long-gone AS. Operators can also remove +these AS, and flush existing connections to them. + +#### Observations + +After reading the LB code in VPP, I am ready to make a few observations. + +***1: Lameduck*** I have the choice of 'remove AS from VIP', by removing it from the Maglev new-flow +table it will not get new flows assigned but if there are long-lived clients, the server will keep +connections open potentially indefinitely. A good example is a websocket that streams data between +a client and the webserver: it never disconnects! + +My other choice is to 'Remove and flush AS from VIP', which will also remove it from being eligible +for new flows, but forcibly remove all existing flows from the flow hash table at the same time. +Yikes. + +I want a middle ground, operationally: +1. Remove AS from VIP for _new connections_ while keeping existing ones for a grace period. This +is commonly referred to as _lameduck_ mode. +1. Remove AS from VIP for _all connections_, which will reset any lingering connections and move +them to another backend where they reconnect and continue on their journey. + +***2: Slow undrain***: From my own experience, adding a new AS needs to often be done carefully, for +two reasons. First, sloshing traffic around can overwhelm a new / freshly started server which does +lazy initialization (for example, a Java binary). Second, a new server may have a different +configuration on purpose, for example different version of the server binary, or different +parameters like caching flags and what-not. It may be good to ease in traffic and inspect it for a +little while before bringing full load onto the server. This is commonly referred to as a _canary_ +backend. I'll come back to this later. + +### VPP LB: Bugs + +While playing around with the plugin's binary API, I ran into a collection of bugs that made the +plugin largely unusable via the API (as opposed to the CLI). I fixed those in Gerrit +[[45428](https://gerrit.fd.io/r/c/vpp/+/45428)]. + +* ***IPv4 VIP prefixlen offset bug***: `lb_add_del_vip()` was computing the prefix length +incorrectly for IPv4 addresses due to an off-by-one in the address family handling, producing +VIPs that silently matched no traffic. + +* ***Wrong encap type on VIP create***: Both `lb_add_del_vip()` and `lb_add_del_vip_v2()` +were passing the encapsulation type through an incorrect enum mapping, so a VIP created with GRE4 +encap via the API would actually end up configured with a different encap type internally. + +* ***lb_vip_dump() returning wrong fields***: The dump handler was returning a stale encap +type and an incorrect protocol value, making it impossible to verify what was actually configured +via the API. + +* ***lb_as_dump() port filter broken***: The AS dump call accepts an optional VIP filter. The +port comparison was being done against an uninitialized variable, causing the filter to miss +entries or match wrong ones depending on stack contents. + +* ***Missing lb_conf_get()***: There was no API call to retrieve the global LB configuration +(flow table size, timeout values). I added `lb_conf_get()` so an operator or controlplane can verify +the running configuration without resorting to CLI parsing. + +* ***'show lb vips' unformatting error***: The CLI handler dereferenced a pointer that +is only valid in verbose mode, causing unexpected output (and a possible crash!) on a plain `show lb +vips`. + +* ***GC only triggered by CLI input***: The garbage collector for the flow table was only +invoked when the operator typed a CLI command. On a production load balancer, stale flow entries +would accumulate indefinitely. So I added a periodic GC timer that automatically cleans up the flow +hash table. + +While discussing on the `vpp-dev` mailing list, my buddy Jerome Tollet independently found two of +these bugs (the encap type mismatch and the dump port filter) and reported them during review. Both +are addressed in the latest patchset. + +### VPP LB: New Feature - Weights + +My attempt to address the two observations above comes from an insight that they are actually the +same class of problem: I want to be able to set a variable amount of traffic anywhere from 100% all +the way down to 0% of load that a given backend is capable of handling, and I want to be able to +flush (remove existing flows from the flow hash table) independently of the new-flow assignment. +This is commonly referred to as _weights_ in a load balancer, and in Gerrit +[[45487](https://gerrit.fd.io/r/c/vpp/+/45487)] I add per-AS weights to the Maglev new flow table, +and decouple 'flush' from 'set weight' semantically. + +The motivation comes from the two operational scenarios I kept running into while testing the plugin: + +**1. Draining a backend without disrupting existing sessions.** When a backend needs to go down for +maintenance, the only option was `lb as del flush`, which both removes the AS *and* flushes the +flow table. Flushing the flow table is disruptive: all existing TCP sessions that were pinned to +any backend suddenly need to re-select, causing a brief spike of misdirected packets. What I +actually want is to stop sending *new* flows to the AS while letting existing sessions drain +naturally. + +**2. Introducing a new backend gradually.** When adding a new AS to a busy VIP, the Maglev algorithm +immediately assigns it ~1/N of the new-flow table slots. On a VIP handling tens of thousands of +new connections per second, that is a lot of traffic hitting a backend that may not yet be fully +warmed up (think JVM JIT, filled caches, established database connections). It would be useful to +introduce the new AS slowly and ramp it up over time. + +My solution for both is to allow each AS to carry a weight in the range 0–100, which controls what +fraction of the new flow table slots it is allowed to occupy: + +* ***weight 100*** (default): the AS gets its full ~1/N share of slots. This is the existing +behavior, and remains the default. +* ***weight 1–99***: the AS gets a proportionally smaller share. Useful for gradual introduction +as well as gradual removal. +* ***weight 0***: the AS gets no slots in the new flow table — no new flows are sent to it. The +flow table entries for existing sessions remain intact, so those connections keep working until +they naturally expire. + +The Maglev fill algorithm is made weight-aware by scaling each AS's preference list length +proportionally to its weight. The sort order is deterministic (sorted by `(replica, address)`) +so the resulting table is identical regardless of the order ASes were added, which also has a bonus +side effect of making anycast and ECMP VIPs work correctly. + +Because VPP developers do not change API signatures once they are published, I added a few new API +calls instead: + +* `lb_as_add_del_v2()` — creates or deletes an AS with an explicit weight, and optionally +flushes the flow table for that AS on deletion. +* `lb_as_dump_v2()` — returns the weight and the number of new-flow-table buckets currently +assigned to each AS, which is useful for verifying the distribution. +* `lb_as_set_weight()` — changes the weight of an existing AS in place, optionally flushing +the flow table, without needing to delete and recreate the AS. + +From the CLI, the weight is set with: + +``` +vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0 +vpp# lb as 192.0.2.0/32 10.0.0.1 weight 1 +vpp# lb as 192.0.2.0/32 10.0.0.1 weight 10 +vpp# lb as 192.0.2.0/32 10.0.0.1 weight 100 +vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0 +vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0 flush +``` + +In the sequence above, backend AS `10.0.0.1` starts off fully drained, then getting a token amount +of traffic by setting it to weight 1, then 10, and finally 100. When the backend needs to be +removed, I can set `weight 0` which will put it in _lameduck_ mode but keep existing flows alive. +A few minutes later, I can set it to `weight 0 flush` which will remove the remaining existing +flows. The backend then can be safely removed, without having to wait 5+ hours like I did with the +uncontrolled DNS 'drain'. + +### VPP LB: New Feature - Punt Unknown + +I'm still on the fence on this feature, but since I wrote it .. Gerrit +[[45431](https://gerrit.fd.io/r/c/vpp/+/45431)] adds a `punt` flag to port-based VIPs. + +By default, when a VIP is configured with a specific protocol and port (e.g. TCP/443), any packet +that arrives at that VIP's address but does *not* match the configured {protocol, port} pair is +sent by VPP to `error-drop`. This is the correct behavior for most cases: if I am load balancing +TCP/443, I do not want stray UDP packets forwarded anywhere. + +The problem is that this also drops ICMP. If an operator runs `traceroute` towards the VIP, or +sends an ICMP echo, or a client receives an ICMP unreachable, all of that is silently discarded. +This makes the VIP opaque from the network's perspective and can complicate debugging. + +When creating a port-based VIP, I decide to add a `punt` flag, so any traffic that does not match +the configured protocol/port pairs on the VIP will newly be punted to the local IP stack +(`ip4-local` or `ip6-local`) instead of dropped. To make this work, I ask VPP to insert the VIP's +address into the FIB at a higher priority than device routes, so the punt path is actually +reachable. This allows the load balancer to handle TCP/443 (or whatever protocol/port combinations +are configured) while the local stack takes care of ICMP, traceroute, and anything else that arrives +at that address and is not a part of the maglev configuration. + +The `punt` flag is only permitted on port-based VIPs — on a protocol-agnostic VIP there is +nothing left to punt, since all traffic is already matched and forwarded to application servers. + +Enabling this from the CLI is straightforward, at creation time: + +``` +vpp# loopback create interface instance 0 +vpp# lcp create loop0 host-if maglev0 +vpp# set int state loop0 up +vpp# set int ip address loop0 192.0.2.0/32 +vpp# lb vip 192.0.2.0/32 protocol tcp port 443 encap gre4 punt +``` + +In this configuration snippet, I first create a simple loopback device with a given IPv4 address, +and plumb it through to Linux using the [[Linux CP]({{< ref 2021-08-12-vpp-1 >}})] plugin. This makes +it reachable, I can ping it and traceroute to it just like any other Linux Interface Pair _LIP_. +Then, I _steal_ some traffic from it, by creating an LB VIP on this address. Without this feature, +the VIP would become unreachable, as the LB plugin would take all traffic destined to the IPv4 +address. But with the `punt` keyword, any traffic not matching the LB VIP(s) on this address, will +be sent onwards to the IP stack and end up in Linux. For those of us who like pinging their VIPs, +the `punt` feature flag on VIPs will come in handy. + +For the same reason as with the other feature I wrote, I need to add new API calls rather than +changing existing ones, so here I go: + +* `lb_add_del_vip_v3()` — adds a `is_punt` flag to the VIP creation call. +* `lb_vip_dump_v2()` — returns `is_punt` in the VIP details, so an operator or controlplane can +verify the configuration. + +## What's Next + +I am going to use Maglev at IPng Networks to load balance our services like SMTP, IMAP, HTTP, DNS and +what-not. But before I can do that, I'm going to want to write some sort of controlplane that can +manipulate the VIPs, AS weights, and do things like health checking. I'm inspired by +[[HAProxy](https://haproxy.org/)] which I used to use way back when. I find its health checking +algorithm particularly clever, so I will give that codebase a good read and with what I learn, +create a health checking VPP Maglev controlplane which will give me much better insight into what +traffic goes where. + +Stay tuned! diff --git a/static/assets/vpp-maglev/maglev.png b/static/assets/vpp-maglev/maglev.png new file mode 100644 index 0000000..ea09ceb --- /dev/null +++ b/static/assets/vpp-maglev/maglev.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:6d8e95e4b63d7f70ddb30cb55a6ae11cdd0474a81cb98662cfc8bd7e73bcb7bf +size 20190 diff --git a/static/assets/vpp-maglev/qps-before.png b/static/assets/vpp-maglev/qps-before.png new file mode 100644 index 0000000..fe80546 --- /dev/null +++ b/static/assets/vpp-maglev/qps-before.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2ccc1c662dc1740e390c0dc43872f6cdd5b4ce910964c1f5db2f841545e46954 +size 179055