Maglev part1

2026-04-30 10:45:14 +02:00
parent 04ef0c83be
commit c3a7249bcc
3 changed files with 358 additions and 0 deletions
@@ -0,0 +1,352 @@
 ---
 date: "2026-04-30T06:35:14Z"
 title: VPP with Maglev Loadbalancing - Part 1
 ---
 {{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
 # About this series
 Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
 performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
 _ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
 are shared between the two.
 Load balancing is one of those topics that sounds deceptively simple until you think about it for a
 while. In this article I take the VPP load balancer plugin out for a spin, fix a handful of API bugs,
 and add two small new features that make running it in production a little bit easier.
 ## Introduction
 IPng runs services that want to be reachable via as few public IP addresses as possible. Let's say I
 want to run a DNS resolver or authoritative nameserver or even the IPng website, but I want these to
 be highly available and perhaps scale to more traffic than one backend server could provide. What
 are my options?
 My first option is just put a bunch of servers online and give them all an A/AAAA record, and put
 them all in DNS, say 7 webservers, and then point `ipng.ch` to those. It's clumsy, notably if one
 server is down for maintenance or failure, one seventh of the traffic may still want to reach it.
 Also, removing a server will have lots of lingering traffic stay on the webserver, as clients are
 sometimes slow to pick up the DNS changes, even if my TTL is low.
 Let me show you an example:
 {{< image width="100%" src="/assets/vpp-maglev/qps-before.png" alt="NGINX qps per instance" >}}
 There are two main problems with this graph:
 1.   ***Load imbalance***: there are seven webservers in this graph, but somehow only three of them
 are getting traffic, the others are not. One is much more heavily loaded (`nginx0.chrma0`) than the
 others. It's receiving 1.2kqps while others are receiving ~40qps. This poses a risk when the clients
 that are somehow attracted to this instance grow, they may overwhelm this little webserver, even if
 there are six others that could help out!
 2.   ***Drains take _forever_***: The green graph was a drain of `nginx0.nlams2` due to a pending
 maintenance window as the datacenter is closing and the server needs to be physically moved. I put
 in the DNS change at around 16:15 UTC and the traffic finally dropped at 21:45, a full five hours
 (!) later. And believe it or not, the TTL was 15 minutes on these records. Some clients just don't
 get the hint ...
 ### Load balancing 101
 A naive load balancing solution is to simply round-robin: send each new packet to the next backend in the list.
 That works reasonably well for stateless UDP traffic like DNS, although even with DNS there is a gotcha: some
 DNS queries need TCP, for example those that are too big to fit in a single UDP packet, and they
 will not be tolerant of naive packet round robin.  For TCP this naive load balancing solution
 quickly falls apart, because every packet in a connection needs to reach the *same* backend. Sending
 a SYN to backend A and the subsequent ACK to backend B will not establish a TCP connection. 
 The classical answer is to keep per-session state on the load balancer: a table that maps a 5-tuple of
 {source IP, destination IP, source port, destination port, protocol} to a chosen backend. That
 works, but it introduces a stateful bottleneck. At line rate on a load balancer handling millions of
 flows and packets/sec, maintaining and synchronising that table across multiple CPU threads is expensive. It also means that if the
 load balancer restarts, every existing TCP session breaks.
 What if there was some form of *consistent hashing*: given the 5-tuple of a packet, the load balancer
 might always select the same backend deterministically, without storing any per-session state. If backends come and go, only
 the flows that were assigned to the changed backend are affected — all other flows keep working.
 Google solved this problem at scale and published their solution. They call it Maglev.
 ## Introducing Maglev
 {{< image width="12em" float="right" src="/assets/vpp-maglev/maglev.png" alt="Icon of a maglev train" >}}
 Google's Maglev load balancer has been running in production since 2008 and I happen to know several
 of its authors - as a personal aside I was sad to learn that Cody Smith, with whom I shared an office
 and a team for many years, passed away earlier this year. Rest in peace, Cody!
 The Google team published their design at NSDI 2016 in the paper
 [[Maglev: A Fast and Reliable Software Network Load Balancer](https://research.google/pubs/maglev-a-fast-and-reliable-software-network-load-balancer/)].
 It is worth reading in full — the paper is well written and covers not only the hashing algorithm
 but also the wider architecture of how Google handles frontend traffic at scale.
 The key insight is that Maglev uses a pre-computed lookup table of size M (some large prime number,
 65537 in the paper) filled with backend indices. To handle a packet, the forwarder computes a
 hash over the 5-tuple modulo M, looks up the table, and forwards to whatever backend is stored
 there. No per-session state is needed avoiding the need for session matching and lots of RAM, and
 the flow lookup can be done super efficiently.
 ### The Maglev new flow table
 The interesting part is _how_ that lookup table is filled. I learn that a simple approach might be
 to divide M slots evenly among N backends. That would work, but removing a backend would shift every
 remaining backend's range, disrupting all flows and resetting TCP connections all over the place.
 Maglev uses a smarter fill algorithm:
 1.   For each backend _i_, derive two independent hash values from its identity (typically its IP
 address): an offset and a skip value. These define a *preference list* — a permutation of all M
 slots that this backend would like to occupy, in preference order.
 1.   Iterate over all backends round-robin. Each backend claims its next preferred slot if it is
 still free. Continue until every slot is filled.
 The result is a table where each backend occupies approximately M/N slots, the distribution is
 uniform, and most importantly, adding or removing one backend only displaces approximately 1/N of
 the flows. All other flows keep hashing to the same backend. Slick!
 ### The Maglev existing flow hash table
 Consistent hashing handles the common case well, but there is one subtlety: the hashing
 guarantees that the *same 5-tuple always maps to the same backend*, but only as long as the set
 of backends does not change. If a backend is added mid-stream, a fraction of existing TCP
 connections will start hashing to a different backend.
 To protect long-lived connections, Maglev keeps a small per-CPU *flow hash table*: an LRU cache of
 recently seen 5-tuple to backend mappings. For every packet:
 1.   Look up in the Maglev flow hash table. On a hit, forward to the cached backend (even if the Maglev
 table would now say something different).
 1.   On a miss, look up the Maglev new-flow table, select the backend, and insert the mapping into the
 flow hash table.
 The flow hash table does not need to be exhaustive — it only needs to cover *active* connections. An
 LRU eviction policy handles the rest. This means the load balancer is *mostly* stateless, as the
 Maglev table is deterministic and identical on every CPU, with just enough per-connection state
 to protect existing TCP sessions from transient backend changes.
 ### VPP LB: Plugin anatomy
 The VPP load balancer plugin lives in `src/plugins/lb/`. Its core data structures map directly to
 the Maglev design:
 *   ***VIP*** (Virtual IP): a prefix plus an optional {protocol, port} pair. This is the
 public-facing address that clients connect to. A VIP can be protocol-agnostic and forward all
 traffic to its backends, or it can be port-specific and forward only for example TCP/443 to its
 backends.
 *   ***AS*** (Application Server): a backend endpoint associated with a VIP. The plugin
 maintains a list of active ASes per VIP.
 *   ***New flow table***: the Maglev lookup table, computed from the active AS list whenever an
 AS is added or removed. Size is configurable, defaulting to 1024 entries. It is filled by the
 clever algorithm described above.
 *   ***Flow hash table***: per-worker LRU hash table of recent {5-tuple → AS} mappings. This is the
 connection affinity cache described above.
 *   ***Encapsulation***: packets are forwarded to the AS by encapsulating them in either GRE
 (GRE4 or GRE6), or via L3DSR (direct server return using DSCP remarking).  The AS decapsulates and
 responds directly to the client, bypassing the load balancer on the return path.
 When a new flow arrives, VPP computes a hash over the 5-tuple modulo the length of its
 new_flow_table, it then looks up the backend that will serve this client, stores it in the
 per-worker flow hash table, and encapsulates the packet towards the AS. Subsequent packets for the
 same 5-tuple hit the flow hash table directly, skipping the Maglev lookup entirely.
 A garbage collection timer periodically walks the flow table and removes entries for backends that
 have become inactive, preventing stale flows from reaching a long-gone AS. Operators can also remove
 these AS, and flush existing connections to them.
 #### Observations
 After reading the LB code in VPP, I am ready to make a few observations.
 ***1: Lameduck*** I have the choice of 'remove AS from VIP', by removing it from the Maglev new-flow
 table it will not get new flows assigned but if there are long-lived clients, the server will keep
 connections open potentially indefinitely. A good example is a websocket that streams data between
 a client and the webserver: it never disconnects!
 My other choice is to 'Remove and flush AS from VIP', which will also remove it from being eligible
 for new flows, but forcibly remove all existing flows from the flow hash table at the same time.
 Yikes.
 I want a middle ground, operationally:
 1.   Remove AS from VIP for _new connections_ while keeping existing ones for a grace period. This
 is commonly referred to as _lameduck_ mode.
 1.   Remove AS from VIP for _all connections_, which will reset any lingering connections and move
 them to another backend where they reconnect and continue on their journey.
 ***2: Slow undrain***: From my own experience, adding a new AS needs to often be done carefully, for
 two reasons. First, sloshing traffic around can overwhelm a new / freshly started server which does
 lazy initialization (for example, a Java binary). Second, a new server may have a different
 configuration on purpose, for example different version of the server binary, or different
 parameters like caching flags and what-not. It may be good to ease in traffic and inspect it for a
 little while before bringing full load onto the server. This is commonly referred to as a _canary_
 backend. I'll come back to this later.
 ### VPP LB: Bugs
 While playing around with the plugin's binary API, I ran into a collection of bugs that made the
 plugin largely unusable via the API (as opposed to the CLI). I fixed those in Gerrit
 [[45428](https://gerrit.fd.io/r/c/vpp/+/45428)].
 *   ***IPv4 VIP prefixlen offset bug***: `lb_add_del_vip()` was computing the prefix length
 incorrectly for IPv4 addresses due to an off-by-one in the address family handling, producing
 VIPs that silently matched no traffic.
 *   ***Wrong encap type on VIP create***: Both `lb_add_del_vip()` and `lb_add_del_vip_v2()`
 were passing the encapsulation type through an incorrect enum mapping, so a VIP created with GRE4
 encap via the API would actually end up configured with a different encap type internally.
 *   ***lb_vip_dump() returning wrong fields***: The dump handler was returning a stale encap
 type and an incorrect protocol value, making it impossible to verify what was actually configured
 via the API.
 *   ***lb_as_dump() port filter broken***: The AS dump call accepts an optional VIP filter. The
 port comparison was being done against an uninitialized variable, causing the filter to miss
 entries or match wrong ones depending on stack contents.
 *   ***Missing lb_conf_get()***: There was no API call to retrieve the global LB configuration
 (flow table size, timeout values). I added `lb_conf_get()` so an operator or controlplane can verify
 the running configuration without resorting to CLI parsing.
 *   ***'show lb vips' unformatting error***: The CLI handler dereferenced a pointer that
 is only valid in verbose mode, causing unexpected output (and a possible crash!) on a plain `show lb
 vips`.
 *   ***GC only triggered by CLI input***: The garbage collector for the flow table was only
 invoked when the operator typed a CLI command. On a production load balancer, stale flow entries
 would accumulate indefinitely. So I added a periodic GC timer that automatically cleans up the flow
 hash table.
 While discussing on the `vpp-dev` mailing list, my buddy Jerome Tollet independently found two of
 these bugs (the encap type mismatch and the dump port filter) and reported them during review. Both
 are addressed in the latest patchset.
 ### VPP LB: New Feature - Weights
 My attempt to address the two observations above comes from an insight that they are actually the
 same class of problem: I want to be able to set a variable amount of traffic anywhere from 100% all
 the way down to 0% of load that a given backend is capable of handling, and I want to be able to
 flush (remove existing flows from the flow hash table) independently of the new-flow assignment.
 This is commonly referred to as _weights_ in a load balancer, and in Gerrit
 [[45487](https://gerrit.fd.io/r/c/vpp/+/45487)] I add per-AS weights to the Maglev new flow table,
 and decouple 'flush' from 'set weight' semantically.
 The motivation comes from the two operational scenarios I kept running into while testing the plugin:
 **1. Draining a backend without disrupting existing sessions.** When a backend needs to go down for
 maintenance, the only option was `lb as del flush`, which both removes the AS *and* flushes the
 flow table. Flushing the flow table is disruptive: all existing TCP sessions that were pinned to
 any backend suddenly need to re-select, causing a brief spike of misdirected packets. What I
 actually want is to stop sending *new* flows to the AS while letting existing sessions drain
 naturally.
 **2. Introducing a new backend gradually.** When adding a new AS to a busy VIP, the Maglev algorithm
 immediately assigns it ~1/N of the new-flow table slots. On a VIP handling tens of thousands of
 new connections per second, that is a lot of traffic hitting a backend that may not yet be fully
 warmed up (think JVM JIT, filled caches, established database connections). It would be useful to
 introduce the new AS slowly and ramp it up over time.
 My solution for both is to allow each AS to carry a weight in the range 0–100, which controls what
 fraction of the new flow table slots it is allowed to occupy:
 *   ***weight 100*** (default): the AS gets its full ~1/N share of slots. This is the existing
 behavior, and remains the default.
 *   ***weight 1–99***: the AS gets a proportionally smaller share. Useful for gradual introduction
 as well as gradual removal.
 *   ***weight 0***: the AS gets no slots in the new flow table — no new flows are sent to it. The
 flow table entries for existing sessions remain intact, so those connections keep working until
 they naturally expire.
 The Maglev fill algorithm is made weight-aware by scaling each AS's preference list length
 proportionally to its weight. The sort order is deterministic (sorted by `(replica, address)`)
 so the resulting table is identical regardless of the order ASes were added, which also has a bonus
 side effect of making anycast and ECMP VIPs work correctly.
 Because VPP developers do not change API signatures once they are published, I added a few new API
 calls instead:
 *   `lb_as_add_del_v2()` — creates or deletes an AS with an explicit weight, and optionally
 flushes the flow table for that AS on deletion.
 *   `lb_as_dump_v2()` — returns the weight and the number of new-flow-table buckets currently
 assigned to each AS, which is useful for verifying the distribution.
 *   `lb_as_set_weight()` — changes the weight of an existing AS in place, optionally flushing
 the flow table, without needing to delete and recreate the AS.
 From the CLI, the weight is set with:
 ```
 vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0
 vpp# lb as 192.0.2.0/32 10.0.0.1 weight 1
 vpp# lb as 192.0.2.0/32 10.0.0.1 weight 10
 vpp# lb as 192.0.2.0/32 10.0.0.1 weight 100
 vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0
 vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0 flush
 ```
 In the sequence above, backend AS `10.0.0.1` starts off fully drained, then getting a token amount
 of traffic by setting it to weight 1, then 10, and finally 100. When the backend needs to be
 removed, I can set `weight 0` which will put it in _lameduck_ mode but keep existing flows alive.
 A few minutes later, I can set it to `weight 0 flush` which will remove the remaining existing
 flows. The backend then can be safely removed, without having to wait 5+ hours like I did with the
 uncontrolled DNS 'drain'.
 ### VPP LB: New Feature - Punt Unknown
 I'm still on the fence on this feature, but since I wrote it .. Gerrit
 [[45431](https://gerrit.fd.io/r/c/vpp/+/45431)] adds a `punt` flag to port-based VIPs.
 By default, when a VIP is configured with a specific protocol and port (e.g. TCP/443), any packet
 that arrives at that VIP's address but does *not* match the configured {protocol, port} pair is
 sent by VPP to `error-drop`. This is the correct behavior for most cases: if I am load balancing
 TCP/443, I do not want stray UDP packets forwarded anywhere.
 The problem is that this also drops ICMP. If an operator runs `traceroute` towards the VIP, or
 sends an ICMP echo, or a client receives an ICMP unreachable, all of that is silently discarded.
 This makes the VIP opaque from the network's perspective and can complicate debugging.
 When creating a port-based VIP, I decide to add a `punt` flag, so any traffic that does not match
 the configured protocol/port pairs on the VIP will newly be punted to the local IP stack
 (`ip4-local` or `ip6-local`) instead of dropped. To make this work, I ask VPP to insert the VIP's
 address into the FIB at a higher priority than device routes, so the punt path is actually
 reachable. This allows the load balancer to handle TCP/443 (or whatever protocol/port combinations
 are configured) while the local stack takes care of ICMP, traceroute, and anything else that arrives
 at that address and is not a part of the maglev configuration.
 The `punt` flag is only permitted on port-based VIPs — on a protocol-agnostic VIP there is
 nothing left to punt, since all traffic is already matched and forwarded to application servers.
 Enabling this from the CLI is straightforward, at creation time:
 ```
 vpp# loopback create interface instance 0
 vpp# lcp create loop0 host-if maglev0
 vpp# set int state loop0 up
 vpp# set int ip address loop0 192.0.2.0/32
 vpp# lb vip 192.0.2.0/32 protocol tcp port 443 encap gre4 punt
 ```
 In this configuration snippet, I first create a simple loopback device with a given IPv4 address,
 and plumb it through to Linux using the [[Linux CP]({{< ref 2021-08-12-vpp-1 >}})] plugin. This makes
 it reachable, I can ping it and traceroute to it just like any other Linux Interface Pair _LIP_.
 Then, I _steal_ some traffic from it, by creating an LB VIP on this address. Without this feature,
 the VIP would become unreachable, as the LB plugin would take all traffic destined to the IPv4
 address. But with the `punt` keyword, any traffic not matching the LB VIP(s) on this address, will
 be sent onwards to the IP stack and end up in Linux.  For those of us who like pinging their VIPs,
 the `punt` feature flag on VIPs will come in handy.
 For the same reason as with the other feature I wrote, I need to add new API calls rather than
 changing existing ones, so here I go:
 *   `lb_add_del_vip_v3()` — adds a `is_punt` flag to the VIP creation call.
 *   `lb_vip_dump_v2()` — returns `is_punt` in the VIP details, so an operator or controlplane can
 verify the configuration.
 ## What's Next
 I am going to use Maglev at IPng Networks to load balance our services like SMTP, IMAP, HTTP, DNS and
 what-not. But before I can do that, I'm going to want to write some sort of controlplane that can
 manipulate the VIPs, AS weights, and do things like health checking. I'm inspired by
 [[HAProxy](https://haproxy.org/)] which I used to use way back when. I find its health checking
 algorithm particularly clever, so I will give that codebase a good read and with what I learn,
 create a health checking VPP Maglev controlplane which will give me much better insight into what
 traffic goes where.
 Stay tuned!