Typo and formatting fixes
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
2025-07-12 11:38:35 +02:00
parent 85b41ba4e0
commit b6b419471d

View File

@ -10,12 +10,12 @@ title: 'VPP and eVPN/VxLAN - Part 1'
You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I'm
the very last on the planet to learn about something cool. My latest "A-Ha!"-moment was when I was
configuring the eVPN fabric for [[Frys-IX](https://frys-ix.net/)], and I wrote up an article about
it [[here]({<< ref 2025-04-09-frysix-evpn >>})] back in April.
it [[here]({{< ref 2025-04-09-frysix-evpn >}})] back in April.
I can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased
Lines, and these are straight forward because they typically only have two endpoints. A "regular"
VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a
look at an article on [[L2 Gymnastics]({<< ref 2022-01-12-vpp-l2 >>})] for that. But the real kicker
look at an article on [[L2 Gymnastics]({{< ref 2022-01-12-vpp-l2 >}})] for that. But the real kicker
is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also
called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And *that* is a whole other
level of awesome.
@ -63,15 +63,16 @@ vpp0# set int l2 bridge vxlan_tunnel2 8298
vpp0# set int l2 bridge vxlan_tunnel3 8298
```
I have to replicate this configuration to all other `vpp1`-`vpp4` routers. While it does work, it's
really not very practical. When other VPP instances get added to a VPLS, every other router will
have to have a new VxLAN tunnel created and added to its local bridge domain. Consider 1000s of VPLS
instances on 100s of routers, it would yield ~100'000 VxLAN tunnels on every router, yikes!
To make this work, I will have to replicate this configuration to all other `vpp1`-`vpp4` routers.
While it does work, it's really not very practical. When other VPP instances get added to a VPLS,
every other router will have to have a new VxLAN tunnel created and added to its local bridge
domain. Consider 1000s of VPLS instances on 100s of routers, it would yield ~100'000 VxLAN tunnels
on every router, yikes!
Such a configuration reminds me in a way of iBGP in a large network: the naive approach is to have a
full mesh of all routers speaking to all other routers, but that quickly becomes a maintenance
headache. The canonical solution for this is to create iBGP _Route Reflectors_ to which every router
connects, and their job is to redistribute routing information between the fleet of routers. THis
connects, and their job is to redistribute routing information between the fleet of routers. This
turns the iBGP problem from an O(N^2) to an O(N) problem: all 1'000 routers connect to, say, three
regional route reflectors for a total of 3'000 BGP connections, which is much better than ~1'000'000
BGP connections in the naive approach.
@ -100,8 +101,8 @@ configuration:
{{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
For the controlplane parts, [[FRRouting](https://frrouting.org/)] has a working implementation for
L2 (MAC-VRF) and L3 (IP-VRF). My favorite, [[Bird](https://birg.nic.cz/)] is slowly catching up, and
has a few of these control plane parts set up (mostly MAC-VRF). Commercial vendors like Arista,
L2 (MAC-VRF) and L3 (IP-VRF). My favorite, [[Bird](https://birg.nic.cz/)], is slowly catching up, and
has a few of these controlplane parts already working (mostly MAC-VRF). Commercial vendors like Arista,
Nokia, Juniper, Cisco are ready to go. If we want VPP to inter-operate, we may need to make a few
changes.
@ -122,21 +123,21 @@ that I am supposed to send 'flood' packets to. Similar to the Bridge Domain, whe
for flooding, I will need to prepare and replicate them, sending them to each VTEP.
1. A set of APIs will be needed to manipulate these:
* ***Interface***: I will need to have an interface create, delete and list call, which will
be able to maintain the interfaces, their metadata like source address, source/destination port,
VNI and such.
* ***L2FIB***: I will need to add, replace, delete, and list which MAC addresses go where,
With such a table, each time a packet is handled for a given Dynamic VxLAN interface, the
dst_addr can be written into the packet.
* ***Flooding***: For those packets that are not unicast (BUM), I will need to be able to add,
remove and list which VTEPs should receive this packet.
A set of APIs will be needed to manipulate these:
* ***Interface***: I will need to have an interface create, delete and list call, which will
be able to maintain the interfaces, their metadata like source address, source/destination port,
VNI and such.
* ***L2FIB***: I will need to add, replace, delete, and list which MAC addresses go where,
With such a table, each time a packet is handled for a given Dynamic VxLAN interface, the
dst_addr can be written into the packet.
* ***Flooding***: For those packets that are not unicast (BUM), I will need to be able to add,
remove and list which VTEPs should receive this packet.
It would be pretty dope if the configuration looked something like this:
```
vpp0# create evpn-vxlan src <v46address> dst-port <port> vni <vni> instance <id>
vpp0# evpn-vxlan l2fib interface <iface> mac <mac> dst <v46address> [del]
vpp0# evpn-vxlan flood interface <iface> dst <v46address> [del]
vpp# create evpn-vxlan src <v46address> dst-port <port> vni <vni> instance <id>
vpp# evpn-vxlan l2fib <iface> mac <mac> dst <v46address> [del]
vpp# evpn-vxlan flood <iface> dst <v46address> [del]
```
The VxLAN underlay transport can be either IPv4 or IPv6. Of course manipulating L2FIB or Flood
@ -144,36 +145,36 @@ destinations must match the address family of an interface of type evpn-vxlan. A
might be:
```
vpp0# create evpn-vxlan src 2001:db8::1 dst-port 4789 vni 8298 instance 6
vpp0# evpn-vxlan l2fib interface evpn-vxlan0 mac 00:01:02:82:98:02 dst 2001:db8::2
vpp0# evpn-vxlan l2fib interface evpn-vxlan0 mac 00:01:02:82:98:03 dst 2001:db8::3
vpp0# evpn-vxlan flood interface evpn-vxlan0 dst 2001:db8::2
vpp0# evpn-vxlan flood interface evpn-vxlan0 dst 2001:db8::3
vpp0# evpn-vxlan flood interface evpn-vxlan0 dst 2001:db8::4
vpp# create evpn-vxlan src 2001:db8::1 dst-port 4789 vni 8298 instance 6
vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:02 dst 2001:db8::2
vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:03 dst 2001:db8::3
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::2
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::3
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::4
```
By the way, while this _could_ be a new plugin, it could also just be added to the existing VxLAN
plugin. One way in which we might do this is to allow for the creation of a normal vxlan tunnel to
allow for its destination address to be either 0.0.0.0 for IPv4 or :: for IPv6. That would signal
'dynamic' tunneling, upon which the L2FIB and Flood lists are used. It would slow down each VxLAN
packet by the time it takes to call `ip46_address_is_zero()` which is only a handfull of clocks.
plugin. One way in which I might do this when creating a normal vxlan tunnel is to allow for its
destination address to be either 0.0.0.0 for IPv4 or :: for IPv6. That would signal 'dynamic'
tunneling, upon which the L2FIB and Flood lists are used. It would slow down each VxLAN packet by
the time it takes to call `ip46_address_is_zero()` which is only a handfull of clocks.
### Bridge Domain
{{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
It's important to understand that L2 learning is **required** for eVPN to work: each router needs to
be able to tell the iBGP eVPN session which MAC addresses should be forwarded to it. This rules out
the simple case of L2XC because there, no learning is performed. The corollary is that a
It's important to understand that L2 learning is **required** for eVPN to function. Each router
needs to be able to tell the iBGP eVPN session which MAC addresses should be forwarded to it. This
rules out the simple case of L2XC because there, no learning is performed. The corollary is that a
bridge-domain is required for any form of eVPN.
The L2 code in VPP already does most of what I'd need. It maintains an L2FIB in `vnet/l2/l2_fib.c`,
which is keyed by bridge-id and MAC address, and its values are a 64 bit structure that points
essentially to a sw_if_index output. The L2FIB of the eVPN needs a bit more information though,
notably a `ip46address` struct to know which VTEP to send to. It's tempting to add this extra data
to the bridge domain code. I would recommend against it, because other implementations, for example
MPLS, GENEVE or Carrier Pigeon IP may need more than just the destination address. Even the VxLAN
implementation I'm thinking about might want to be able to override other things like the
essentially to a `sw_if_index` output interface. The L2FIB of the eVPN needs a bit more information
though, notably a `ip46address` struct to know which VTEP to send to. It's tempting to add this
extra data to the bridge domain code. I would recommend against it, because other implementations,
for example MPLS, GENEVE or Carrier Pigeon IP may need more than just the destination address. Even
the VxLAN implementation I'm thinking about might want to be able to override other things like the
destination port for a given VTEP, or even the VNI. Putting all of this stuff in the bridge-domain
code will just clutter it, for all users, not just those users who might want eVPN.
@ -211,19 +212,19 @@ receiving a few types of eVPN messages. The ones I'm interested in are:
- Similar to IP information in Type 2, we can talk about those later once L3VPN/eVPN is
needed.
The 'on the way in' stuff can be easily done with the proposed APIs in the Dynamic VxLAN plugin.
Adding, removing, listing L2FIB and Flood lists is easy as far as VPP is concerned. It's just that
the controlplane implementation needs to somehow _feed_ the API, so an external program may be
needed, or alterntively the Linux Control Plane netlink plugin might be used to consume this
information.
The 'on the way in' stuff can be easily done with my proposed APIs in the Dynamic VxLAN (or a new
eVPN VxLAN) plugin. Adding, removing, listing L2FIB and Flood lists is easy as far as VPP is
concerned. It's just that the controlplane implementation needs to somehow _feed_ the API, so an
external program may be needed, or alterntively the Linux Control Plane netlink plugin might be used
to consume this information.
The 'on the way out' stuff is a bit trickier. I will need to listen to creation of new broadcast
domains and associate them with the right IMET announcements, and for each MAC address learned, pick
them up and advertise them into eVPN. Later, once ARP and ND proxying is a thing, I'll have to
revisit the bridge-domain feature to do IPv4 ARP and IPv6 Neighbor Discovery, and replace it with
some code that populates the IPv4/IPv6 parts of the Type2 messages on the way out, and similarly on
the way in, populates an L3 neighbor cache for the bridge domain, so ARP and ND replies can be
synthesized based on what we've learned in eVPN.
them up and advertise them into eVPN. Later, if ever ARP and ND proxying becomes important, I'll
have to revisit the bridge-domain feature to do IPv4 ARP and IPv6 Neighbor Discovery, and replace it
with some code that populates the IPv4/IPv6 parts of the Type2 messages on the way out, and
similarly on the way in, populates an L3 neighbor cache for the bridge domain, so ARP and ND replies
can be synthesized based on what we've learned in eVPN.
# Demonstration
@ -233,22 +234,22 @@ I'll build a small demo environment on Summer to show how the interaction of VxL
Domain works today:
```
create tap host-if-name dummy0 host-mtu-size 9216 host-ip4-addr 192.0.2.1/24
set int state tap0 up
set int ip address tap0 192.0.2.1/24
set ip neighbor tap0 192.0.2.254 01:02:03:82:98:fe static
set ip neighbor tap0 192.0.2.2 01:02:03:82:98:02 static
set ip neighbor tap0 192.0.2.3 01:02:03:82:98:03 static
vpp# create tap host-if-name dummy0 host-mtu-size 9216 host-ip4-addr 192.0.2.1/24
vpp# set int state tap0 up
vpp# set int ip address tap0 192.0.2.1/24
vpp# set ip neighbor tap0 192.0.2.254 01:02:03:82:98:fe static
vpp# set ip neighbor tap0 192.0.2.2 01:02:03:82:98:02 static
vpp# set ip neighbor tap0 192.0.2.3 01:02:03:82:98:03 static
create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298
set int state vxlan_tunnel0 up
vpp# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298
vpp# set int state vxlan_tunnel0 up
create tap host-if-name vpptap0 host-mtu-size 9216 hw-addr 02:fe:64:dc:1b:82
set int state tap1 up
vpp# create tap host-if-name vpptap0 host-mtu-size 9216 hw-addr 02:fe:64:dc:1b:82
vpp# set int state tap1 up
create bridge-domain 8298
set int l2 bridge tap1 8298
set int l2 bridge vxlan_tunnel0 8298
vpp# create bridge-domain 8298
vpp# set int l2 bridge tap1 8298
vpp# set int l2 bridge vxlan_tunnel0 8298
```
I've created a tap device called `dummy0` and gave it an IPv4 address. Normally, I would use some
@ -307,9 +308,8 @@ pim@summer:~$ sudo tcpdump -evni dummy0
192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
```
I want to point out that nothing, so far, has anything to do with my gerrit change, all of this
works with upstream VPP just fine. I can see two VxLAN encapsulated packets, both destined to
`192.0.2.254:4789`. Cool.
I want to point out that nothing, so far, is special. All of this works with upstream VPP just fine.
I can see two VxLAN encapsulated packets, both destined to `192.0.2.254:4789`. Cool.
### Dynamic VPP VxLAN