Files
ipng.ch/content/articles/2025-07-12-vpp-evpn-1.md
Pim van Pelt 5d02b6466c
All checks were successful
continuous-integration/drone/push Build is passing
Bump timestamp for publication
2025-07-12 11:48:56 +02:00

376 lines
20 KiB
Markdown

---
date: "2025-07-12T08:07:23Z"
title: 'VPP and eVPN/VxLAN - Part 1'
---
{{< image width="6em" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# Introduction
You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I'm
the very last on the planet to learn about something cool. My latest "A-Ha!"-moment was when I was
configuring the eVPN fabric for [[Frys-IX](https://frys-ix.net/)], and I wrote up an article about
it [[here]({{< ref 2025-04-09-frysix-evpn >}})] back in April.
I can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased
Lines, and these are straight forward because they typically only have two endpoints. A "regular"
VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a
look at an article on [[L2 Gymnastics]({{< ref 2022-01-12-vpp-l2 >}})] for that. But the real kicker
is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also
called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And *that* is a whole other
level of awesome.
## Recap: VPP today
### VPP: VxLAN
The current VPP VxLAN tunnel plugin does point to point tunnels, that is they are configured with a
source address, destination address, destination port and VNI. As I mentioned, a point to point
ethernet transport is configured very easily:
```
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298 instance 0
vpp0# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/0
vpp0# set int l2 xconnect HundredGigabitEthernet10/0/0 vxlan_tunnel0
vpp0# set int state vxlan_tunnel0 up
vpp0# set int state HundredGigabitEthernet10/0/0 up
vpp1# create vxlan tunnel src 192.0.2.254 dst 192.0.2.1 vni 8298 instance 0
vpp1# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/1
vpp1# set int l2 xconnect HundredGigabitEthernet10/0/1 vxlan_tunnel0
vpp1# set int state vxlan_tunnel0 up
vpp1# set int state HundredGigabitEthernet10/0/1 up
```
And with that, `vpp0:Hu10/0/0` is cross connected with `vpp1:Hu10/0/1` and ethernet flows between
the two.
### VPP: Bridge Domains
Now consider a VPLS with five different routers. While it's possible to create a bridge-domain and add
some local ports and four other VxLAN tunnels:
```
vpp0# create bridge-domain 8298
vpp0# set int l2 bridge HundredGigabitEthernet10/0/1 8298
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.2 vni 8298 instance 0
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.3 vni 8298 instance 1
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.4 vni 8298 instance 2
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.5 vni 8298 instance 3
vpp0# set int l2 bridge vxlan_tunnel0 8298
vpp0# set int l2 bridge vxlan_tunnel1 8298
vpp0# set int l2 bridge vxlan_tunnel2 8298
vpp0# set int l2 bridge vxlan_tunnel3 8298
```
To make this work, I will have to replicate this configuration to all other `vpp1`-`vpp4` routers.
While it does work, it's really not very practical. When other VPP instances get added to a VPLS,
every other router will have to have a new VxLAN tunnel created and added to its local bridge
domain. Consider 1000s of VPLS instances on 100s of routers, it would yield ~100'000 VxLAN tunnels
on every router, yikes!
Such a configuration reminds me in a way of iBGP in a large network: the naive approach is to have a
full mesh of all routers speaking to all other routers, but that quickly becomes a maintenance
headache. The canonical solution for this is to create iBGP _Route Reflectors_ to which every router
connects, and their job is to redistribute routing information between the fleet of routers. This
turns the iBGP problem from an O(N^2) to an O(N) problem: all 1'000 routers connect to, say, three
regional route reflectors for a total of 3'000 BGP connections, which is much better than ~1'000'000
BGP connections in the naive approach.
## Recap: eVPN Moving parts
The reason why I got so enthusiastic when I was playing with Arista and Nokia's eVPN stuff, is
because it requires very little dataplane configuration, and a relatively intuitive controlplane
configuration:
1. **Dataplane**: For each L2 broadcast domain (be it a L2XC or a Bridge Domain), really all I
need is a single VxLAN interface with a given VNI, which should be able to send encapsulated
ethernet frames to one more more other speakers in the same domain.
1. **Controlplane**: I will need to learn MAC addresses locally, and inform some BGP eVPN
implementation of who-lives-where. Other VxLAN speakers learn of the MAC addresses I own, and
will send me encapsulated ethernet for those addresses
1. **Dataplane**: For unknown layer2 destinations, like _Broadcast_, _Unknown Unicast_, and
_Multicast_ (BUM) traffic, I will want to keep track of which other VxLAN speakers these
packets should be flooded. I make note that this is not that different to flooding the packets
to local interfaces, except here it'd be flooding them to remote VxLAN endpoints.
1. **ControlPlane**: Flooding L2 traffic across wide area networks is typically considered icky,
so a few tricks might be optionally deployed. Since the controlplane already knows which MAC
lives where, it may as well also make note of any local IPv6 ARP and IPv6 neighbor discovery
replies and teach its peers which IPv4/IPv6 addresses live where: a distributed neighbor table.
{{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
For the controlplane parts, [[FRRouting](https://frrouting.org/)] has a working implementation for
L2 (MAC-VRF) and L3 (IP-VRF). My favorite, [[Bird](https://birg.nic.cz/)], is slowly catching up, and
has a few of these controlplane parts already working (mostly MAC-VRF). Commercial vendors like Arista,
Nokia, Juniper, Cisco are ready to go. If we want VPP to inter-operate, we may need to make a few
changes.
## VPP: Changes needed
### Dynamic VxLAN
I propose two changes to the VxLAN plugin, or perhaps, a new plugin that changes the behavior so that
we don't have to break any performance or functional promises to existing users. This new VxLAN
interface behavior changes in the following ways:
1. Each VxLAN interface has a local L2FIB attached to it, the keys are MAC address and the
values are remote VTEPs. In its simplest form, the values would be just IPv4 or IPv6 addresses,
because I can re-use the VNI and port information from the tunnel definition itself.
1. Each VxLAN interface has a local flood-list attached to it. This list contains remote VTEPs
that I am supposed to send 'flood' packets to. Similar to the Bridge Domain, when packets are marked
for flooding, I will need to prepare and replicate them, sending them to each VTEP.
A set of APIs will be needed to manipulate these:
* ***Interface***: I will need to have an interface create, delete and list call, which will
be able to maintain the interfaces, their metadata like source address, source/destination port,
VNI and such.
* ***L2FIB***: I will need to add, replace, delete, and list which MAC addresses go where,
With such a table, each time a packet is handled for a given Dynamic VxLAN interface, the
dst_addr can be written into the packet.
* ***Flooding***: For those packets that are not unicast (BUM), I will need to be able to add,
remove and list which VTEPs should receive this packet.
It would be pretty dope if the configuration looked something like this:
```
vpp# create evpn-vxlan src <v46address> dst-port <port> vni <vni> instance <id>
vpp# evpn-vxlan l2fib <iface> mac <mac> dst <v46address> [del]
vpp# evpn-vxlan flood <iface> dst <v46address> [del]
```
The VxLAN underlay transport can be either IPv4 or IPv6. Of course manipulating L2FIB or Flood
destinations must match the address family of an interface of type evpn-vxlan. A practical example
might be:
```
vpp# create evpn-vxlan src 2001:db8::1 dst-port 4789 vni 8298 instance 6
vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:02 dst 2001:db8::2
vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:03 dst 2001:db8::3
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::2
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::3
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::4
```
By the way, while this _could_ be a new plugin, it could also just be added to the existing VxLAN
plugin. One way in which I might do this when creating a normal vxlan tunnel is to allow for its
destination address to be either 0.0.0.0 for IPv4 or :: for IPv6. That would signal 'dynamic'
tunneling, upon which the L2FIB and Flood lists are used. It would slow down each VxLAN packet by
the time it takes to call `ip46_address_is_zero()` which is only a handfull of clocks.
### Bridge Domain
{{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
It's important to understand that L2 learning is **required** for eVPN to function. Each router
needs to be able to tell the iBGP eVPN session which MAC addresses should be forwarded to it. This
rules out the simple case of L2XC because there, no learning is performed. The corollary is that a
bridge-domain is required for any form of eVPN.
The L2 code in VPP already does most of what I'd need. It maintains an L2FIB in `vnet/l2/l2_fib.c`,
which is keyed by bridge-id and MAC address, and its values are a 64 bit structure that points
essentially to a `sw_if_index` output interface. The L2FIB of the eVPN needs a bit more information
though, notably a `ip46address` struct to know which VTEP to send to. It's tempting to add this
extra data to the bridge domain code. I would recommend against it, because other implementations,
for example MPLS, GENEVE or Carrier Pigeon IP may need more than just the destination address. Even
the VxLAN implementation I'm thinking about might want to be able to override other things like the
destination port for a given VTEP, or even the VNI. Putting all of this stuff in the bridge-domain
code will just clutter it, for all users, not just those users who might want eVPN.
Similarly, one might argue it is tempting to re-use/extend the behavior in `vnet/l2/l2_flood.c`,
because if it's already replicating BUM traffic, why not replicate it many times over the flood list
for any member interface that happens to be a dynamic VxLAN interface? This would be a bad idea
because of a few reasons. Firstly, it is not guaranteed that the VxLAN plugin is loaded, and in
doing this, I would leak internal details of VxLAN into the bridge-domain code. Secondly, the
`l2_flood.c` code would potentially get messy if other types were added (like the MPLS and GENEVE
above).
A reasonable request is to mark such BUM frames once in the existing L2 code and when handing the
replicated packet into the VxLAN node, to see the `is_bum` marker and once again replicate -- in the
vxlan plugin -- these packets to the VTEPs in our local flood-list. Although a bit more work, this
approach only requires a tiny amount of work in the `l2_flood.c` code (the marking), and will keep
all the logic tucked away where it is relevant, derisking the VPP vnet codebase.
Fundamentally, I think the cleanest design is to keep the dynamic VxLAN interface fully
self-contained and it would therefor maintain its own L2FIB and Flooding logic. The only thing I
would add to the L2 codebase is some form of BUM marker to allow for efficient flooding.
### Control Plane
There's a few things the control plane has to do. Some external agent, like FRR or Bird, will be
receiving a few types of eVPN messages. The ones I'm interested in are:
* ***Type 2***: MAC/IP Advertisement Route
- On the way in, these should be fed to the VxLAN L2FIB belonging to the bridge-domain.
- On the way out, learned addresses should be advertised to peers.
- Regarding IPv4/IPv6 addresses, that is the ARP / ND tables: we can talk about those later.
* ***Type 3***: Inclusive Multicast Ethernet Tag Route
- On the way in, these will populate the VxLAN Flood list belonging to the bridge-domain
- On the way out, each bridge-domain should advertise itself as IMET to peers.
* ***Type 5***: IP Prefix Route
- Similar to IP information in Type 2, we can talk about those later once L3VPN/eVPN is
needed.
The 'on the way in' stuff can be easily done with my proposed APIs in the Dynamic VxLAN (or a new
eVPN VxLAN) plugin. Adding, removing, listing L2FIB and Flood lists is easy as far as VPP is
concerned. It's just that the controlplane implementation needs to somehow _feed_ the API, so an
external program may be needed, or alterntively the Linux Control Plane netlink plugin might be used
to consume this information.
The 'on the way out' stuff is a bit trickier. I will need to listen to creation of new broadcast
domains and associate them with the right IMET announcements, and for each MAC address learned, pick
them up and advertise them into eVPN. Later, if ever ARP and ND proxying becomes important, I'll
have to revisit the bridge-domain feature to do IPv4 ARP and IPv6 Neighbor Discovery, and replace it
with some code that populates the IPv4/IPv6 parts of the Type2 messages on the way out, and
similarly on the way in, populates an L3 neighbor cache for the bridge domain, so ARP and ND replies
can be synthesized based on what we've learned in eVPN.
# Demonstration
### VPP: Current VxLAN
I'll build a small demo environment on Summer to show how the interaction of VxLAN and Bridge
Domain works today:
```
vpp# create tap host-if-name dummy0 host-mtu-size 9216 host-ip4-addr 192.0.2.1/24
vpp# set int state tap0 up
vpp# set int ip address tap0 192.0.2.1/24
vpp# set ip neighbor tap0 192.0.2.254 01:02:03:82:98:fe static
vpp# set ip neighbor tap0 192.0.2.2 01:02:03:82:98:02 static
vpp# set ip neighbor tap0 192.0.2.3 01:02:03:82:98:03 static
vpp# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298
vpp# set int state vxlan_tunnel0 up
vpp# create tap host-if-name vpptap0 host-mtu-size 9216 hw-addr 02:fe:64:dc:1b:82
vpp# set int state tap1 up
vpp# create bridge-domain 8298
vpp# set int l2 bridge tap1 8298
vpp# set int l2 bridge vxlan_tunnel0 8298
```
I've created a tap device called `dummy0` and gave it an IPv4 address. Normally, I would use some
DPDK or RDMA interface like `TenGigabutEthernet10/0/0`. Then I'll populate some static ARP entries.
Again, normally this would just be 'use normal routing'. However, for the purposes of this
demonstration, it helps to use a TAP device, as any packets I make VPP send to those 192.0.2.254 and
so on, can be captured with `tcpdump` in Linux in addition to `trace add` in VPP.
Then, I create a VxLAN tunnel with a default destination of 192.0.2.254 and the given VNI.
Next, I create a TAP interface called `vpptap0` with the given MAC address.
Finally, I bind these two interfaces together in a bridge-domain.
I proceed to write a small ScaPY program:
```python
#!/usr/bin/env python3
from scapy.all import Ether, IP, UDP, Raw, sendp
pkt = Ether(dst="01:02:03:04:05:02", src="02:fe:64:dc:1b:82", type=0x0800)
/ IP(src="192.168.1.1", dst="192.168.1.2")
/ UDP(sport=8298, dport=7) / Raw(load=b"ping")
print(pkt)
sendp(pkt, iface="vpptap0")
pkt = Ether(dst="01:02:03:04:05:03", src="02:fe:64:dc:1b:82", type=0x0800)
/ IP(src="192.168.1.1", dst="192.168.1.3")
/ UDP(sport=8298, dport=7) / Raw(load=b"ping")
print(pkt)
sendp(pkt, iface="vpptap0")
```
What will happen is, the ScaPY program will emit these frames into device `vpptap0` which is in
bridge-domain 8298. The bridge will learn our src MAC `02:fe:64:dc:1b:82`, and look up the dst MAC
`01:02:03:04:05:02`, and because there hasn't been traffic yet, it'll flood to all member ports, one
of which is the VxLAN tunnel. VxLAN will then encapsulate the packets to the other side of the
tunnel.
```
pim@summer:~$ sudo ./vxlan-test.py
Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.2:echo / Raw
Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.3:echo / Raw
pim@summer:~$ sudo tcpdump -evni dummy0
10:50:35.310620 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82)
192.0.2.1.6345 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298
02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46:
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4
10:50:35.362552 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82)
192.0.2.1.23916 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298
02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46:
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
```
I want to point out that nothing, so far, is special. All of this works with upstream VPP just fine.
I can see two VxLAN encapsulated packets, both destined to `192.0.2.254:4789`. Cool.
### Dynamic VPP VxLAN
I wrote a prototype for a Dynamic VxLAN tunnel in [[43433](https://gerrit.fd.io/r/c/vpp/+/43433)].
The good news is, this works. The bad news is, I think I'll want to discuss my proposal (this
article) with the community before going further down a potential rabbit hole.
With my gerrit patched in, I can do the following:
```
vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:02 dst 192.0.2.2
Added VXLAN dynamic destination for 01:02:03:04:05:02 on vxlan_tunnel0 dst 192.0.2.2
vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:03 dst 192.0.2.3
Added VXLAN dynamic destination for 01:02:03:04:05:03 on vxlan_tunnel0 dst 192.0.2.3
vpp# show vxlan l2fib
VXLAN Dynamic L2FIB entries:
MAC Interface Destination Port VNI
01:02:03:04:05:02 vxlan_tunnel0 192.0.2.2 4789 8298
01:02:03:04:05:03 vxlan_tunnel0 192.0.2.3 4789 8298
Dynamic L2FIB entries: 2
```
I've instructed the VxLAN tunnel to change the tunnel destination based on the destination MAC.
I run the script and tcpdump again:
```
pim@summer:~$ sudo tcpdump -evni dummy0
11:16:53.834619 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3945 (->3997)!)
192.0.2.1.6345 > 192.0.2.2.4789: VXLAN, flags [I] (0x08), vni 8298
02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46:
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4
11:16:53.882554 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3944 (->3996)!)
192.0.2.1.23916 > 192.0.2.3.4789: VXLAN, flags [I] (0x08), vni 8298
02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46:
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
```
Two important notes: Firstly, this works! For the MAC address ending in `:02`, send the packet to
`192.0.2.2` instead of the default of `192.0.2.254`. Same for the `:03` MAC which now goes to
`192.0.2.3`. Nice! But secondly, the IPv4 header of the VxLAN packets was changed, so there needs to
be a call to `ip4_header_checksum()` inserted somewhere. That's an easy fix.
# What's next
I want to discuss a few things, perhaps at an upcoming VPP Community meeting. Notably:
1. Is the VPP Developer community supportive of adding eVPN support? Does anybody want to help
write it with me?
1. Is changing the existing VxLAN plugin appropriate, or should I make a new plugin which adds
dynamic endpoints, L2FIB and Flood lists for BUM traffic?
1. Is it acceptable for me to add a BUM marker in `l2_flood.c` so that I can reuse all the logic
from bridge-domain flooding as I extend to also do VTEP flooding?
1. (perhaps later) VxLAN is the canonical underlay, but is there an appetite to extend also to,
say, GENEVE or MPLS?
1. (perhaps later) What's a good way to tie in a controlplane like FRRouting or Bird2 into the
dataplane (perhaps using a sidecar controller, or perhaps using Linux CP Netlink messages)?