Add a proposal article for eVPN/VxLAN in VPP

2025-07-12 11:27:33 +02:00
parent ebbb0f8e24
commit 85b41ba4e0
1 changed files with 375 additions and 0 deletions
--- a/content/articles/2025-07-12-vpp-evpn-1.md
+++ b/content/articles/2025-07-12-vpp-evpn-1.md
@@ -0,0 +1,375 @@
 ---
 date: "2025-07-12T10:07:23Z"
 title: 'VPP and eVPN/VxLAN - Part 1'
 ---
 {{< image width="6em" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
 # Introduction
 You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I'm
 the very last on the planet to learn about something cool. My latest "A-Ha!"-moment was when I was
 configuring the eVPN fabric for [[Frys-IX](https://frys-ix.net/)], and I wrote up an article about
 it [[here]({<< ref 2025-04-09-frysix-evpn >>})] back in April.
 I can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased
 Lines, and these are straight forward because they typically only have two endpoints. A "regular"
 VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a
 look at an article on [[L2 Gymnastics]({<< ref 2022-01-12-vpp-l2 >>})] for that. But the real kicker
 is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also
 called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And *that* is a whole other
 level of awesome.
 ## Recap: VPP today
 ### VPP: VxLAN
 The current VPP VxLAN tunnel plugin does point to point tunnels, that is they are configured with a
 source address, destination address, destination port and VNI. As I mentioned, a point to point
 ethernet transport is configured very easily:
 ```
 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298 instance 0
 vpp0# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/0
 vpp0# set int l2 xconnect HundredGigabitEthernet10/0/0 vxlan_tunnel0
 vpp0# set int state vxlan_tunnel0 up
 vpp0# set int state HundredGigabitEthernet10/0/0 up
 vpp1# create vxlan tunnel src 192.0.2.254 dst 192.0.2.1 vni 8298 instance 0
 vpp1# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/1
 vpp1# set int l2 xconnect HundredGigabitEthernet10/0/1 vxlan_tunnel0
 vpp1# set int state vxlan_tunnel0 up
 vpp1# set int state HundredGigabitEthernet10/0/1 up
 ```
 And with that, `vpp0:Hu10/0/0` is cross connected with `vpp1:Hu10/0/1` and ethernet flows between
 the two.
 ### VPP: Bridge Domains
 Now consider a VPLS with five different routers. While it's possible to create a bridge-domain and add
 some local ports and four other VxLAN tunnels:
 ```
 vpp0# create bridge-domain 8298
 vpp0# set int l2 bridge HundredGigabitEthernet10/0/1 8298
 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.2 vni 8298 instance 0
 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.3 vni 8298 instance 1
 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.4 vni 8298 instance 2
 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.5 vni 8298 instance 3
 vpp0# set int l2 bridge vxlan_tunnel0 8298
 vpp0# set int l2 bridge vxlan_tunnel1 8298
 vpp0# set int l2 bridge vxlan_tunnel2 8298
 vpp0# set int l2 bridge vxlan_tunnel3 8298
 ```
 I have to replicate this configuration to all other `vpp1`-`vpp4` routers. While it does work, it's
 really not very practical. When other VPP instances get added to a VPLS, every other router will
 have to have a new VxLAN tunnel created and added to its local bridge domain. Consider 1000s of VPLS
 instances on 100s of routers, it would yield ~100'000 VxLAN tunnels on every router, yikes!
 Such a configuration reminds me in a way of iBGP in a large network: the naive approach is to have a
 full mesh of all routers speaking to all other routers, but that quickly becomes a maintenance
 headache. The canonical solution for this is to create iBGP _Route Reflectors_ to which every router
 connects, and their job is to redistribute routing information between the fleet of routers. THis
 turns the iBGP problem from an O(N^2) to an O(N) problem: all 1'000 routers connect to, say, three
 regional route reflectors for a total of 3'000 BGP connections, which is much better than ~1'000'000
 BGP connections in the naive approach.
 ## Recap: eVPN Moving parts
 The reason why I got so enthusiastic when I was playing with Arista and Nokia's eVPN stuff, is
 because it requires very little dataplane configuration, and a relatively intuitive controlplane
 configuration:
 1.   **Dataplane**: For each L2 broadcast domain (be it a L2XC or a Bridge Domain), really all I
     need is a single VxLAN interface with a given VNI, which should be able to send encapsulated
     ethernet frames to one more more other speakers in the same domain.
 1.   **Controlplane**: I will need to learn MAC addresses locally, and inform some BGP eVPN
     implementation of who-lives-where. Other VxLAN speakers learn of the MAC addresses I own, and
     will send me encapsulated ethernet for those addresses
 1.   **Dataplane**: For unknown layer2 destinations, like _Broadcast_, _Unknown Unicast_, and
     _Multicast_ (BUM) traffic, I will want to keep track of which other VxLAN speakers these
     packets should be flooded. I make note that this is not that different to flooding the packets
     to local interfaces, except here it'd be flooding them to remote VxLAN endpoints.
 1.   **ControlPlane**: Flooding L2 traffic across wide area networks is typically considered icky,
     so a few tricks might be optionally deployed. Since the controlplane already knows which MAC
     lives where, it may as well also make note of any local IPv6 ARP and IPv6 neighbor discovery
     replies and teach its peers which IPv4/IPv6 addresses live where: a distributed neighbor table.
 {{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
 For the controlplane parts, [[FRRouting](https://frrouting.org/)] has a working implementation for
 L2 (MAC-VRF) and L3 (IP-VRF). My favorite, [[Bird](https://birg.nic.cz/)] is slowly catching up, and
 has a few of these control plane parts set up (mostly MAC-VRF). Commercial vendors like Arista,
 Nokia, Juniper, Cisco are ready to go. If we want VPP to inter-operate, we may need to make a few
 changes.
 ## VPP: Changes needed
 ### Dynamic VxLAN
 I propose two changes to the VxLAN plugin, or perhaps, a new plugin that changes the behavior so that
 we don't have to break any performance or functional promises to existing users. This new VxLAN
 interface behavior changes in the following ways:
 1.    Each VxLAN interface has a local L2FIB attached to it, the keys are MAC address and the
 values are remote VTEPs.  In its simplest form, the values would be just IPv4 or IPv6 addresses,
 because I can re-use the VNI and port information from the tunnel definition itself.
 1.    Each VxLAN interface has a local flood-list attached to it. This list contains remote VTEPs
 that I am supposed to send 'flood' packets to. Similar to the Bridge Domain, when packets are marked
 for flooding, I will need to prepare and replicate them, sending them to each VTEP.
 1.    A set of APIs will be needed to manipulate these:
      *    ***Interface***: I will need to have an interface create,  delete and list call, which will
      be able to maintain the interfaces, their metadata like source address, source/destination port,
      VNI and such.
      *    ***L2FIB***: I will need to add, replace, delete, and list which MAC addresses go where,
      With such a table, each time a packet is handled for a given Dynamic VxLAN interface, the
      dst_addr can be written into the packet.
      *   ***Flooding***: For those packets that are not unicast (BUM), I will need to be able to add,
      remove and list which VTEPs should receive this packet.
 It would be pretty dope if the configuration looked something like this:
 ```
 vpp0# create evpn-vxlan src <v46address> dst-port <port> vni <vni> instance <id>
 vpp0# evpn-vxlan l2fib interface <iface> mac <mac> dst <v46address> [del]
 vpp0# evpn-vxlan flood interface <iface> dst <v46address> [del]
 ```
 The VxLAN underlay transport can be either IPv4 or IPv6. Of course manipulating L2FIB or Flood
 destinations must match the address family of an interface of type evpn-vxlan. A practical example
 might be:
 ```
 vpp0# create evpn-vxlan src 2001:db8::1 dst-port 4789 vni 8298 instance 6
 vpp0# evpn-vxlan l2fib interface evpn-vxlan0 mac 00:01:02:82:98:02 dst 2001:db8::2
 vpp0# evpn-vxlan l2fib interface evpn-vxlan0 mac 00:01:02:82:98:03 dst 2001:db8::3
 vpp0# evpn-vxlan flood interface evpn-vxlan0 dst 2001:db8::2
 vpp0# evpn-vxlan flood interface evpn-vxlan0 dst 2001:db8::3
 vpp0# evpn-vxlan flood interface evpn-vxlan0 dst 2001:db8::4
 ```
 By the way, while this _could_ be a new plugin, it could also just be added to the existing VxLAN
 plugin. One way in which we might do this is to allow for the creation of a normal vxlan tunnel to
 allow for its destination address to be either 0.0.0.0 for IPv4 or :: for IPv6. That would signal
 'dynamic' tunneling, upon which the L2FIB and Flood lists are used. It would slow down each VxLAN
 packet by the time it takes to call `ip46_address_is_zero()` which is only a handfull of clocks.
 ### Bridge Domain
 {{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
 It's important to understand that L2 learning is **required** for eVPN to work: each router needs to
 be able to tell the iBGP eVPN session which MAC addresses should be forwarded to it. This rules out
 the simple case of L2XC because there, no learning is performed. The corollary is that a
 bridge-domain is required for any form of eVPN.
 The L2 code in VPP already does most of what I'd need. It maintains an L2FIB in `vnet/l2/l2_fib.c`,
 which is keyed by bridge-id and MAC address, and its values are a 64 bit structure that points
 essentially to a sw_if_index output. The L2FIB of the eVPN needs a bit more information though,
 notably a `ip46address` struct to know which VTEP to send to. It's tempting to add this extra data
 to the bridge domain code. I would recommend against it, because other implementations, for example
 MPLS, GENEVE or Carrier Pigeon IP may need more than just the destination address. Even the VxLAN
 implementation I'm thinking about might want to be able to override other things like the
 destination port for a given VTEP, or even the VNI. Putting all of this stuff in the bridge-domain
 code will just clutter it, for all users, not just those users who might want eVPN.
 Similarly, one might argue it is tempting to re-use/extend the behavior in `vnet/l2/l2_flood.c`,
 because if it's already replicating BUM traffic, why not replicate it many times over the flood list
 for any member interface that happens to be a dynamic VxLAN interface?  This would be a bad idea
 because of a few reasons. Firstly, it is not guaranteed that the VxLAN plugin is loaded, and in
 doing this, I would leak internal details of VxLAN into the bridge-domain code. Secondly, the
 `l2_flood.c` code would potentially get messy if other types were added (like the MPLS and GENEVE
 above).
 A reasonable request is to mark such BUM frames once in the existing L2 code and when handing the
 replicated packet into the VxLAN node, to see the `is_bum` marker and once again replicate -- in the
 vxlan plugin -- these packets to the VTEPs in our local flood-list.  Although a bit more work, this
 approach only requires a tiny amount of work in the `l2_flood.c` code (the marking), and will keep
 all the logic tucked away where it is relevant, derisking the VPP vnet codebase.
 Fundamentally, I think the cleanest design is to keep the dynamic VxLAN interface fully
 self-contained and it would therefor maintain its own L2FIB and Flooding logic. The only thing I
 would add to the L2 codebase is some form of BUM marker to allow for efficient flooding.
 ### Control Plane
 There's a few things the control plane has to do. Some external agent, like FRR or Bird, will be
 receiving a few types of eVPN messages. The ones I'm interested in are:
 *   ***Type 2***: MAC/IP Advertisement Route
    -   On the way in, these should be fed to the VxLAN L2FIB belonging to the bridge-domain.
    -   On the way out, learned addresses should be advertised to peers.
    -   Regarding IPv4/IPv6 addresses, that is the ARP / ND tables: we can talk about those later.
 *   ***Type 3***: Inclusive Multicast Ethernet Tag Route
    -   On the way in, these will populate the VxLAN Flood list belonging to the bridge-domain
    -   On the way out, each bridge-domain should advertise itself as IMET to peers.
 *   ***Type 5***: IP Prefix Route
    -   Similar to IP information in Type 2, we can talk about those later once L3VPN/eVPN is
        needed.
 The 'on the way in' stuff can be easily done with the proposed APIs in the Dynamic VxLAN plugin.
 Adding, removing, listing L2FIB and Flood lists is easy as far as VPP is concerned. It's just that
 the controlplane implementation needs to somehow _feed_ the API, so an external program may be
 needed, or alterntively the Linux Control Plane netlink plugin might be used to consume this
 information.
 The 'on the way out' stuff is a bit trickier. I will need to listen to creation of new broadcast
 domains and associate them with the right IMET announcements, and for each MAC address learned, pick
 them up and advertise them into eVPN. Later, once ARP and ND proxying is a thing, I'll have to
 revisit the bridge-domain feature to do IPv4 ARP and IPv6 Neighbor Discovery, and replace it with
 some code that populates the IPv4/IPv6 parts of the Type2 messages on the way out, and similarly on
 the way in, populates an L3 neighbor cache for the bridge domain, so ARP and ND replies can be
 synthesized based on what we've learned in eVPN.
 # Demonstration
 ### VPP: Current VxLAN
 I'll build a small demo environment on Summer to show how the interaction of VxLAN and Bridge
 Domain works today:
 ```
 create tap host-if-name dummy0 host-mtu-size 9216 host-ip4-addr 192.0.2.1/24
 set int state tap0 up
 set int ip address tap0 192.0.2.1/24
 set ip neighbor tap0 192.0.2.254 01:02:03:82:98:fe static
 set ip neighbor tap0 192.0.2.2 01:02:03:82:98:02 static
 set ip neighbor tap0 192.0.2.3 01:02:03:82:98:03 static
 create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298
 set int state vxlan_tunnel0 up
 create tap host-if-name vpptap0 host-mtu-size 9216 hw-addr 02:fe:64:dc:1b:82
 set int state tap1 up
 create bridge-domain 8298
 set int l2 bridge tap1 8298
 set int l2 bridge vxlan_tunnel0 8298
 ```
 I've created a tap device called `dummy0` and gave it an IPv4 address. Normally, I would use some
 DPDK or RDMA interface like `TenGigabutEthernet10/0/0`. Then I'll populate some static ARP entries.
 Again, normally this would just be 'use normal routing'. However, for the purposes of this
 demonstration, it helps to use a TAP device, as any packets I make VPP send to those 192.0.2.254 and
 so on, can be captured with `tcpdump` in Linux in addition to `trace add` in VPP.
 Then, I create a VxLAN tunnel with a default destination of 192.0.2.254 and the given VNI.
 Next, I create a TAP interface called `vpptap0` with the given MAC address.
 Finally, I bind these two interfaces together in a bridge-domain.
 I proceed to write a small ScaPY program:
 ```python
 #!/usr/bin/env python3
 from scapy.all import Ether, IP, UDP, Raw, sendp
 pkt = Ether(dst="01:02:03:04:05:02", src="02:fe:64:dc:1b:82", type=0x0800)
      / IP(src="192.168.1.1", dst="192.168.1.2")
      / UDP(sport=8298, dport=7) / Raw(load=b"ping")
 print(pkt)
 sendp(pkt, iface="vpptap0")
 pkt = Ether(dst="01:02:03:04:05:03", src="02:fe:64:dc:1b:82", type=0x0800)
      / IP(src="192.168.1.1", dst="192.168.1.3")
      / UDP(sport=8298, dport=7) / Raw(load=b"ping")
 print(pkt)
 sendp(pkt, iface="vpptap0")
 ```
 What will happen is, the ScaPY program will emit these frames into device `vpptap0` which is in
 bridge-domain 8298. The bridge will learn our src MAC `02:fe:64:dc:1b:82`, and look up the dst MAC
 `01:02:03:04:05:02`, and because there hasn't been traffic yet, it'll flood to all member ports, one
 of which is the VxLAN tunnel. VxLAN will then encapsulate the packets to the other side of the
 tunnel.
 ```
 pim@summer:~$ sudo ./vxlan-test.py 
 Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.2:echo / Raw
 Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.3:echo / Raw
 pim@summer:~$ sudo tcpdump -evni dummy0
 10:50:35.310620 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
    (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82)
    192.0.2.1.6345 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298
      02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46:
      (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
        192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4
 10:50:35.362552 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
    (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82)
    192.0.2.1.23916 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298
      02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46:
      (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
        192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
 ```
 I want to point out that nothing, so far, has anything to do with my gerrit change, all of this
 works with upstream VPP just fine. I can see two VxLAN encapsulated packets, both destined to
 `192.0.2.254:4789`. Cool.
 ### Dynamic VPP VxLAN
 I wrote a prototype for a Dynamic VxLAN tunnel in [[43433](https://gerrit.fd.io/r/c/vpp/+/43433)].
 The good news is, this works. The bad news is, I think I'll want to discuss my proposal (this
 article) with the community before going further down a potential rabbit hole.
 With my gerrit patched in, I can do the following:
 ```
 vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:02 dst 192.0.2.2       
 Added VXLAN dynamic destination for 01:02:03:04:05:02 on vxlan_tunnel0 dst 192.0.2.2
 vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:03 dst 192.0.2.3
 Added VXLAN dynamic destination for 01:02:03:04:05:03 on vxlan_tunnel0 dst 192.0.2.3
 vpp# show vxlan l2fib 
 VXLAN Dynamic L2FIB entries:
        MAC            Interface      Destination     Port      VNI  
 01:02:03:04:05:02   vxlan_tunnel0     192.0.2.2      4789     8298  
 01:02:03:04:05:03   vxlan_tunnel0     192.0.2.3      4789     8298  
 Dynamic L2FIB entries: 2
 ```
 I've instructed the VxLAN tunnel to change the tunnel destination based on the destination MAC.
 I run the script and tcpdump again:
 ```
 pim@summer:~$ sudo tcpdump -evni dummy0
 11:16:53.834619 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
    (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3945 (->3997)!)
    192.0.2.1.6345 > 192.0.2.2.4789: VXLAN, flags [I] (0x08), vni 8298
      02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46:
      (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
        192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4
 11:16:53.882554 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
    (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3944 (->3996)!)
    192.0.2.1.23916 > 192.0.2.3.4789: VXLAN, flags [I] (0x08), vni 8298
      02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46:
      (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
        192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
 ```
 Two important notes: Firstly, this works! For the MAC address ending in `:02`, send the packet to
 `192.0.2.2` instead of the default of `192.0.2.254`. Same for the `:03` MAC which now goes to
 `192.0.2.3`. Nice! But secondly, the IPv4 header of the VxLAN packets was changed, so there needs to
 be a call to `ip4_header_checksum()` inserted somewhere. That's an easy fix.
 # What's next
 I want to discuss a few things, perhaps at an upcoming VPP Community meeting. Notably:
 1.   Is the VPP Developer community supportive of adding eVPN support? Does anybody want to help
     write it with me?
 1.   Is changing the existing VxLAN plugin appropriate, or should I make a new plugin which adds
     dynamic endpoints, L2FIB and Flood lists for BUM traffic?
 1.   Is it acceptable for me to add a BUM marker in `l2_flood.c` so that I can reuse all the logic
     from bridge-domain flooding as I extend to also do VTEP flooding?
 1.   (perhaps later) VxLAN is the canonical underlay, but is there an appetite to extend also to,
     say, GENEVE or MPLS?
 1.   (perhaps later) What's a good way to tie in a controlplane like FRRouting or Bird2 into the
     dataplane (perhaps using a sidecar controller, or perhaps using Linux CP Netlink messages)?