--- date: "2025-07-12T10:07:23Z" title: 'VPP and eVPN/VxLAN - Part 1' --- {{< image width="6em" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} # Introduction You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I'm the very last on the planet to learn about something cool. My latest "A-Ha!"-moment was when I was configuring the eVPN fabric for [[Frys-IX](https://frys-ix.net/)], and I wrote up an article about it [[here]({{< ref 2025-04-09-frysix-evpn >}})] back in April. I can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased Lines, and these are straight forward because they typically only have two endpoints. A "regular" VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a look at an article on [[L2 Gymnastics]({{< ref 2022-01-12-vpp-l2 >}})] for that. But the real kicker is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And *that* is a whole other level of awesome. ## Recap: VPP today ### VPP: VxLAN The current VPP VxLAN tunnel plugin does point to point tunnels, that is they are configured with a source address, destination address, destination port and VNI. As I mentioned, a point to point ethernet transport is configured very easily: ``` vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298 instance 0 vpp0# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/0 vpp0# set int l2 xconnect HundredGigabitEthernet10/0/0 vxlan_tunnel0 vpp0# set int state vxlan_tunnel0 up vpp0# set int state HundredGigabitEthernet10/0/0 up vpp1# create vxlan tunnel src 192.0.2.254 dst 192.0.2.1 vni 8298 instance 0 vpp1# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/1 vpp1# set int l2 xconnect HundredGigabitEthernet10/0/1 vxlan_tunnel0 vpp1# set int state vxlan_tunnel0 up vpp1# set int state HundredGigabitEthernet10/0/1 up ``` And with that, `vpp0:Hu10/0/0` is cross connected with `vpp1:Hu10/0/1` and ethernet flows between the two. ### VPP: Bridge Domains Now consider a VPLS with five different routers. While it's possible to create a bridge-domain and add some local ports and four other VxLAN tunnels: ``` vpp0# create bridge-domain 8298 vpp0# set int l2 bridge HundredGigabitEthernet10/0/1 8298 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.2 vni 8298 instance 0 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.3 vni 8298 instance 1 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.4 vni 8298 instance 2 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.5 vni 8298 instance 3 vpp0# set int l2 bridge vxlan_tunnel0 8298 vpp0# set int l2 bridge vxlan_tunnel1 8298 vpp0# set int l2 bridge vxlan_tunnel2 8298 vpp0# set int l2 bridge vxlan_tunnel3 8298 ``` To make this work, I will have to replicate this configuration to all other `vpp1`-`vpp4` routers. While it does work, it's really not very practical. When other VPP instances get added to a VPLS, every other router will have to have a new VxLAN tunnel created and added to its local bridge domain. Consider 1000s of VPLS instances on 100s of routers, it would yield ~100'000 VxLAN tunnels on every router, yikes! Such a configuration reminds me in a way of iBGP in a large network: the naive approach is to have a full mesh of all routers speaking to all other routers, but that quickly becomes a maintenance headache. The canonical solution for this is to create iBGP _Route Reflectors_ to which every router connects, and their job is to redistribute routing information between the fleet of routers. This turns the iBGP problem from an O(N^2) to an O(N) problem: all 1'000 routers connect to, say, three regional route reflectors for a total of 3'000 BGP connections, which is much better than ~1'000'000 BGP connections in the naive approach. ## Recap: eVPN Moving parts The reason why I got so enthusiastic when I was playing with Arista and Nokia's eVPN stuff, is because it requires very little dataplane configuration, and a relatively intuitive controlplane configuration: 1. **Dataplane**: For each L2 broadcast domain (be it a L2XC or a Bridge Domain), really all I need is a single VxLAN interface with a given VNI, which should be able to send encapsulated ethernet frames to one more more other speakers in the same domain. 1. **Controlplane**: I will need to learn MAC addresses locally, and inform some BGP eVPN implementation of who-lives-where. Other VxLAN speakers learn of the MAC addresses I own, and will send me encapsulated ethernet for those addresses 1. **Dataplane**: For unknown layer2 destinations, like _Broadcast_, _Unknown Unicast_, and _Multicast_ (BUM) traffic, I will want to keep track of which other VxLAN speakers these packets should be flooded. I make note that this is not that different to flooding the packets to local interfaces, except here it'd be flooding them to remote VxLAN endpoints. 1. **ControlPlane**: Flooding L2 traffic across wide area networks is typically considered icky, so a few tricks might be optionally deployed. Since the controlplane already knows which MAC lives where, it may as well also make note of any local IPv6 ARP and IPv6 neighbor discovery replies and teach its peers which IPv4/IPv6 addresses live where: a distributed neighbor table. {{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}} For the controlplane parts, [[FRRouting](https://frrouting.org/)] has a working implementation for L2 (MAC-VRF) and L3 (IP-VRF). My favorite, [[Bird](https://birg.nic.cz/)], is slowly catching up, and has a few of these controlplane parts already working (mostly MAC-VRF). Commercial vendors like Arista, Nokia, Juniper, Cisco are ready to go. If we want VPP to inter-operate, we may need to make a few changes. ## VPP: Changes needed ### Dynamic VxLAN I propose two changes to the VxLAN plugin, or perhaps, a new plugin that changes the behavior so that we don't have to break any performance or functional promises to existing users. This new VxLAN interface behavior changes in the following ways: 1. Each VxLAN interface has a local L2FIB attached to it, the keys are MAC address and the values are remote VTEPs. In its simplest form, the values would be just IPv4 or IPv6 addresses, because I can re-use the VNI and port information from the tunnel definition itself. 1. Each VxLAN interface has a local flood-list attached to it. This list contains remote VTEPs that I am supposed to send 'flood' packets to. Similar to the Bridge Domain, when packets are marked for flooding, I will need to prepare and replicate them, sending them to each VTEP. A set of APIs will be needed to manipulate these: * ***Interface***: I will need to have an interface create, delete and list call, which will be able to maintain the interfaces, their metadata like source address, source/destination port, VNI and such. * ***L2FIB***: I will need to add, replace, delete, and list which MAC addresses go where, With such a table, each time a packet is handled for a given Dynamic VxLAN interface, the dst_addr can be written into the packet. * ***Flooding***: For those packets that are not unicast (BUM), I will need to be able to add, remove and list which VTEPs should receive this packet. It would be pretty dope if the configuration looked something like this: ``` vpp# create evpn-vxlan src dst-port vni instance vpp# evpn-vxlan l2fib mac dst [del] vpp# evpn-vxlan flood dst [del] ``` The VxLAN underlay transport can be either IPv4 or IPv6. Of course manipulating L2FIB or Flood destinations must match the address family of an interface of type evpn-vxlan. A practical example might be: ``` vpp# create evpn-vxlan src 2001:db8::1 dst-port 4789 vni 8298 instance 6 vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:02 dst 2001:db8::2 vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:03 dst 2001:db8::3 vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::2 vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::3 vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::4 ``` By the way, while this _could_ be a new plugin, it could also just be added to the existing VxLAN plugin. One way in which I might do this when creating a normal vxlan tunnel is to allow for its destination address to be either 0.0.0.0 for IPv4 or :: for IPv6. That would signal 'dynamic' tunneling, upon which the L2FIB and Flood lists are used. It would slow down each VxLAN packet by the time it takes to call `ip46_address_is_zero()` which is only a handfull of clocks. ### Bridge Domain {{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}} It's important to understand that L2 learning is **required** for eVPN to function. Each router needs to be able to tell the iBGP eVPN session which MAC addresses should be forwarded to it. This rules out the simple case of L2XC because there, no learning is performed. The corollary is that a bridge-domain is required for any form of eVPN. The L2 code in VPP already does most of what I'd need. It maintains an L2FIB in `vnet/l2/l2_fib.c`, which is keyed by bridge-id and MAC address, and its values are a 64 bit structure that points essentially to a `sw_if_index` output interface. The L2FIB of the eVPN needs a bit more information though, notably a `ip46address` struct to know which VTEP to send to. It's tempting to add this extra data to the bridge domain code. I would recommend against it, because other implementations, for example MPLS, GENEVE or Carrier Pigeon IP may need more than just the destination address. Even the VxLAN implementation I'm thinking about might want to be able to override other things like the destination port for a given VTEP, or even the VNI. Putting all of this stuff in the bridge-domain code will just clutter it, for all users, not just those users who might want eVPN. Similarly, one might argue it is tempting to re-use/extend the behavior in `vnet/l2/l2_flood.c`, because if it's already replicating BUM traffic, why not replicate it many times over the flood list for any member interface that happens to be a dynamic VxLAN interface? This would be a bad idea because of a few reasons. Firstly, it is not guaranteed that the VxLAN plugin is loaded, and in doing this, I would leak internal details of VxLAN into the bridge-domain code. Secondly, the `l2_flood.c` code would potentially get messy if other types were added (like the MPLS and GENEVE above). A reasonable request is to mark such BUM frames once in the existing L2 code and when handing the replicated packet into the VxLAN node, to see the `is_bum` marker and once again replicate -- in the vxlan plugin -- these packets to the VTEPs in our local flood-list. Although a bit more work, this approach only requires a tiny amount of work in the `l2_flood.c` code (the marking), and will keep all the logic tucked away where it is relevant, derisking the VPP vnet codebase. Fundamentally, I think the cleanest design is to keep the dynamic VxLAN interface fully self-contained and it would therefor maintain its own L2FIB and Flooding logic. The only thing I would add to the L2 codebase is some form of BUM marker to allow for efficient flooding. ### Control Plane There's a few things the control plane has to do. Some external agent, like FRR or Bird, will be receiving a few types of eVPN messages. The ones I'm interested in are: * ***Type 2***: MAC/IP Advertisement Route - On the way in, these should be fed to the VxLAN L2FIB belonging to the bridge-domain. - On the way out, learned addresses should be advertised to peers. - Regarding IPv4/IPv6 addresses, that is the ARP / ND tables: we can talk about those later. * ***Type 3***: Inclusive Multicast Ethernet Tag Route - On the way in, these will populate the VxLAN Flood list belonging to the bridge-domain - On the way out, each bridge-domain should advertise itself as IMET to peers. * ***Type 5***: IP Prefix Route - Similar to IP information in Type 2, we can talk about those later once L3VPN/eVPN is needed. The 'on the way in' stuff can be easily done with my proposed APIs in the Dynamic VxLAN (or a new eVPN VxLAN) plugin. Adding, removing, listing L2FIB and Flood lists is easy as far as VPP is concerned. It's just that the controlplane implementation needs to somehow _feed_ the API, so an external program may be needed, or alterntively the Linux Control Plane netlink plugin might be used to consume this information. The 'on the way out' stuff is a bit trickier. I will need to listen to creation of new broadcast domains and associate them with the right IMET announcements, and for each MAC address learned, pick them up and advertise them into eVPN. Later, if ever ARP and ND proxying becomes important, I'll have to revisit the bridge-domain feature to do IPv4 ARP and IPv6 Neighbor Discovery, and replace it with some code that populates the IPv4/IPv6 parts of the Type2 messages on the way out, and similarly on the way in, populates an L3 neighbor cache for the bridge domain, so ARP and ND replies can be synthesized based on what we've learned in eVPN. # Demonstration ### VPP: Current VxLAN I'll build a small demo environment on Summer to show how the interaction of VxLAN and Bridge Domain works today: ``` vpp# create tap host-if-name dummy0 host-mtu-size 9216 host-ip4-addr 192.0.2.1/24 vpp# set int state tap0 up vpp# set int ip address tap0 192.0.2.1/24 vpp# set ip neighbor tap0 192.0.2.254 01:02:03:82:98:fe static vpp# set ip neighbor tap0 192.0.2.2 01:02:03:82:98:02 static vpp# set ip neighbor tap0 192.0.2.3 01:02:03:82:98:03 static vpp# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298 vpp# set int state vxlan_tunnel0 up vpp# create tap host-if-name vpptap0 host-mtu-size 9216 hw-addr 02:fe:64:dc:1b:82 vpp# set int state tap1 up vpp# create bridge-domain 8298 vpp# set int l2 bridge tap1 8298 vpp# set int l2 bridge vxlan_tunnel0 8298 ``` I've created a tap device called `dummy0` and gave it an IPv4 address. Normally, I would use some DPDK or RDMA interface like `TenGigabutEthernet10/0/0`. Then I'll populate some static ARP entries. Again, normally this would just be 'use normal routing'. However, for the purposes of this demonstration, it helps to use a TAP device, as any packets I make VPP send to those 192.0.2.254 and so on, can be captured with `tcpdump` in Linux in addition to `trace add` in VPP. Then, I create a VxLAN tunnel with a default destination of 192.0.2.254 and the given VNI. Next, I create a TAP interface called `vpptap0` with the given MAC address. Finally, I bind these two interfaces together in a bridge-domain. I proceed to write a small ScaPY program: ```python #!/usr/bin/env python3 from scapy.all import Ether, IP, UDP, Raw, sendp pkt = Ether(dst="01:02:03:04:05:02", src="02:fe:64:dc:1b:82", type=0x0800) / IP(src="192.168.1.1", dst="192.168.1.2") / UDP(sport=8298, dport=7) / Raw(load=b"ping") print(pkt) sendp(pkt, iface="vpptap0") pkt = Ether(dst="01:02:03:04:05:03", src="02:fe:64:dc:1b:82", type=0x0800) / IP(src="192.168.1.1", dst="192.168.1.3") / UDP(sport=8298, dport=7) / Raw(load=b"ping") print(pkt) sendp(pkt, iface="vpptap0") ``` What will happen is, the ScaPY program will emit these frames into device `vpptap0` which is in bridge-domain 8298. The bridge will learn our src MAC `02:fe:64:dc:1b:82`, and look up the dst MAC `01:02:03:04:05:02`, and because there hasn't been traffic yet, it'll flood to all member ports, one of which is the VxLAN tunnel. VxLAN will then encapsulate the packets to the other side of the tunnel. ``` pim@summer:~$ sudo ./vxlan-test.py Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.2:echo / Raw Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.3:echo / Raw pim@summer:~$ sudo tcpdump -evni dummy0 10:50:35.310620 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96: (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82) 192.0.2.1.6345 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298 02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46: (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32) 192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4 10:50:35.362552 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96: (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82) 192.0.2.1.23916 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298 02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46: (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32) 192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4 ``` I want to point out that nothing, so far, is special. All of this works with upstream VPP just fine. I can see two VxLAN encapsulated packets, both destined to `192.0.2.254:4789`. Cool. ### Dynamic VPP VxLAN I wrote a prototype for a Dynamic VxLAN tunnel in [[43433](https://gerrit.fd.io/r/c/vpp/+/43433)]. The good news is, this works. The bad news is, I think I'll want to discuss my proposal (this article) with the community before going further down a potential rabbit hole. With my gerrit patched in, I can do the following: ``` vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:02 dst 192.0.2.2 Added VXLAN dynamic destination for 01:02:03:04:05:02 on vxlan_tunnel0 dst 192.0.2.2 vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:03 dst 192.0.2.3 Added VXLAN dynamic destination for 01:02:03:04:05:03 on vxlan_tunnel0 dst 192.0.2.3 vpp# show vxlan l2fib VXLAN Dynamic L2FIB entries: MAC Interface Destination Port VNI 01:02:03:04:05:02 vxlan_tunnel0 192.0.2.2 4789 8298 01:02:03:04:05:03 vxlan_tunnel0 192.0.2.3 4789 8298 Dynamic L2FIB entries: 2 ``` I've instructed the VxLAN tunnel to change the tunnel destination based on the destination MAC. I run the script and tcpdump again: ``` pim@summer:~$ sudo tcpdump -evni dummy0 11:16:53.834619 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96: (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3945 (->3997)!) 192.0.2.1.6345 > 192.0.2.2.4789: VXLAN, flags [I] (0x08), vni 8298 02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46: (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32) 192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4 11:16:53.882554 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96: (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3944 (->3996)!) 192.0.2.1.23916 > 192.0.2.3.4789: VXLAN, flags [I] (0x08), vni 8298 02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46: (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32) 192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4 ``` Two important notes: Firstly, this works! For the MAC address ending in `:02`, send the packet to `192.0.2.2` instead of the default of `192.0.2.254`. Same for the `:03` MAC which now goes to `192.0.2.3`. Nice! But secondly, the IPv4 header of the VxLAN packets was changed, so there needs to be a call to `ip4_header_checksum()` inserted somewhere. That's an easy fix. # What's next I want to discuss a few things, perhaps at an upcoming VPP Community meeting. Notably: 1. Is the VPP Developer community supportive of adding eVPN support? Does anybody want to help write it with me? 1. Is changing the existing VxLAN plugin appropriate, or should I make a new plugin which adds dynamic endpoints, L2FIB and Flood lists for BUM traffic? 1. Is it acceptable for me to add a BUM marker in `l2_flood.c` so that I can reuse all the logic from bridge-domain flooding as I extend to also do VTEP flooding? 1. (perhaps later) VxLAN is the canonical underlay, but is there an appetite to extend also to, say, GENEVE or MPLS? 1. (perhaps later) What's a good way to tie in a controlplane like FRRouting or Bird2 into the dataplane (perhaps using a sidecar controller, or perhaps using Linux CP Netlink messages)?