Typo and formatting fixes
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -10,12 +10,12 @@ title: 'VPP and eVPN/VxLAN - Part 1'
|
||||
You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I'm
|
||||
the very last on the planet to learn about something cool. My latest "A-Ha!"-moment was when I was
|
||||
configuring the eVPN fabric for [[Frys-IX](https://frys-ix.net/)], and I wrote up an article about
|
||||
it [[here]({<< ref 2025-04-09-frysix-evpn >>})] back in April.
|
||||
it [[here]({{< ref 2025-04-09-frysix-evpn >}})] back in April.
|
||||
|
||||
I can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased
|
||||
Lines, and these are straight forward because they typically only have two endpoints. A "regular"
|
||||
VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a
|
||||
look at an article on [[L2 Gymnastics]({<< ref 2022-01-12-vpp-l2 >>})] for that. But the real kicker
|
||||
look at an article on [[L2 Gymnastics]({{< ref 2022-01-12-vpp-l2 >}})] for that. But the real kicker
|
||||
is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also
|
||||
called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And *that* is a whole other
|
||||
level of awesome.
|
||||
@ -63,15 +63,16 @@ vpp0# set int l2 bridge vxlan_tunnel2 8298
|
||||
vpp0# set int l2 bridge vxlan_tunnel3 8298
|
||||
```
|
||||
|
||||
I have to replicate this configuration to all other `vpp1`-`vpp4` routers. While it does work, it's
|
||||
really not very practical. When other VPP instances get added to a VPLS, every other router will
|
||||
have to have a new VxLAN tunnel created and added to its local bridge domain. Consider 1000s of VPLS
|
||||
instances on 100s of routers, it would yield ~100'000 VxLAN tunnels on every router, yikes!
|
||||
To make this work, I will have to replicate this configuration to all other `vpp1`-`vpp4` routers.
|
||||
While it does work, it's really not very practical. When other VPP instances get added to a VPLS,
|
||||
every other router will have to have a new VxLAN tunnel created and added to its local bridge
|
||||
domain. Consider 1000s of VPLS instances on 100s of routers, it would yield ~100'000 VxLAN tunnels
|
||||
on every router, yikes!
|
||||
|
||||
Such a configuration reminds me in a way of iBGP in a large network: the naive approach is to have a
|
||||
full mesh of all routers speaking to all other routers, but that quickly becomes a maintenance
|
||||
headache. The canonical solution for this is to create iBGP _Route Reflectors_ to which every router
|
||||
connects, and their job is to redistribute routing information between the fleet of routers. THis
|
||||
connects, and their job is to redistribute routing information between the fleet of routers. This
|
||||
turns the iBGP problem from an O(N^2) to an O(N) problem: all 1'000 routers connect to, say, three
|
||||
regional route reflectors for a total of 3'000 BGP connections, which is much better than ~1'000'000
|
||||
BGP connections in the naive approach.
|
||||
@ -100,8 +101,8 @@ configuration:
|
||||
{{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
For the controlplane parts, [[FRRouting](https://frrouting.org/)] has a working implementation for
|
||||
L2 (MAC-VRF) and L3 (IP-VRF). My favorite, [[Bird](https://birg.nic.cz/)] is slowly catching up, and
|
||||
has a few of these control plane parts set up (mostly MAC-VRF). Commercial vendors like Arista,
|
||||
L2 (MAC-VRF) and L3 (IP-VRF). My favorite, [[Bird](https://birg.nic.cz/)], is slowly catching up, and
|
||||
has a few of these controlplane parts already working (mostly MAC-VRF). Commercial vendors like Arista,
|
||||
Nokia, Juniper, Cisco are ready to go. If we want VPP to inter-operate, we may need to make a few
|
||||
changes.
|
||||
|
||||
@ -122,21 +123,21 @@ that I am supposed to send 'flood' packets to. Similar to the Bridge Domain, whe
|
||||
for flooding, I will need to prepare and replicate them, sending them to each VTEP.
|
||||
|
||||
|
||||
1. A set of APIs will be needed to manipulate these:
|
||||
* ***Interface***: I will need to have an interface create, delete and list call, which will
|
||||
be able to maintain the interfaces, their metadata like source address, source/destination port,
|
||||
VNI and such.
|
||||
* ***L2FIB***: I will need to add, replace, delete, and list which MAC addresses go where,
|
||||
With such a table, each time a packet is handled for a given Dynamic VxLAN interface, the
|
||||
dst_addr can be written into the packet.
|
||||
* ***Flooding***: For those packets that are not unicast (BUM), I will need to be able to add,
|
||||
remove and list which VTEPs should receive this packet.
|
||||
A set of APIs will be needed to manipulate these:
|
||||
* ***Interface***: I will need to have an interface create, delete and list call, which will
|
||||
be able to maintain the interfaces, their metadata like source address, source/destination port,
|
||||
VNI and such.
|
||||
* ***L2FIB***: I will need to add, replace, delete, and list which MAC addresses go where,
|
||||
With such a table, each time a packet is handled for a given Dynamic VxLAN interface, the
|
||||
dst_addr can be written into the packet.
|
||||
* ***Flooding***: For those packets that are not unicast (BUM), I will need to be able to add,
|
||||
remove and list which VTEPs should receive this packet.
|
||||
|
||||
It would be pretty dope if the configuration looked something like this:
|
||||
```
|
||||
vpp0# create evpn-vxlan src <v46address> dst-port <port> vni <vni> instance <id>
|
||||
vpp0# evpn-vxlan l2fib interface <iface> mac <mac> dst <v46address> [del]
|
||||
vpp0# evpn-vxlan flood interface <iface> dst <v46address> [del]
|
||||
vpp# create evpn-vxlan src <v46address> dst-port <port> vni <vni> instance <id>
|
||||
vpp# evpn-vxlan l2fib <iface> mac <mac> dst <v46address> [del]
|
||||
vpp# evpn-vxlan flood <iface> dst <v46address> [del]
|
||||
```
|
||||
|
||||
The VxLAN underlay transport can be either IPv4 or IPv6. Of course manipulating L2FIB or Flood
|
||||
@ -144,36 +145,36 @@ destinations must match the address family of an interface of type evpn-vxlan. A
|
||||
might be:
|
||||
|
||||
```
|
||||
vpp0# create evpn-vxlan src 2001:db8::1 dst-port 4789 vni 8298 instance 6
|
||||
vpp0# evpn-vxlan l2fib interface evpn-vxlan0 mac 00:01:02:82:98:02 dst 2001:db8::2
|
||||
vpp0# evpn-vxlan l2fib interface evpn-vxlan0 mac 00:01:02:82:98:03 dst 2001:db8::3
|
||||
vpp0# evpn-vxlan flood interface evpn-vxlan0 dst 2001:db8::2
|
||||
vpp0# evpn-vxlan flood interface evpn-vxlan0 dst 2001:db8::3
|
||||
vpp0# evpn-vxlan flood interface evpn-vxlan0 dst 2001:db8::4
|
||||
vpp# create evpn-vxlan src 2001:db8::1 dst-port 4789 vni 8298 instance 6
|
||||
vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:02 dst 2001:db8::2
|
||||
vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:03 dst 2001:db8::3
|
||||
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::2
|
||||
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::3
|
||||
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::4
|
||||
```
|
||||
|
||||
By the way, while this _could_ be a new plugin, it could also just be added to the existing VxLAN
|
||||
plugin. One way in which we might do this is to allow for the creation of a normal vxlan tunnel to
|
||||
allow for its destination address to be either 0.0.0.0 for IPv4 or :: for IPv6. That would signal
|
||||
'dynamic' tunneling, upon which the L2FIB and Flood lists are used. It would slow down each VxLAN
|
||||
packet by the time it takes to call `ip46_address_is_zero()` which is only a handfull of clocks.
|
||||
plugin. One way in which I might do this when creating a normal vxlan tunnel is to allow for its
|
||||
destination address to be either 0.0.0.0 for IPv4 or :: for IPv6. That would signal 'dynamic'
|
||||
tunneling, upon which the L2FIB and Flood lists are used. It would slow down each VxLAN packet by
|
||||
the time it takes to call `ip46_address_is_zero()` which is only a handfull of clocks.
|
||||
|
||||
### Bridge Domain
|
||||
|
||||
{{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
It's important to understand that L2 learning is **required** for eVPN to work: each router needs to
|
||||
be able to tell the iBGP eVPN session which MAC addresses should be forwarded to it. This rules out
|
||||
the simple case of L2XC because there, no learning is performed. The corollary is that a
|
||||
It's important to understand that L2 learning is **required** for eVPN to function. Each router
|
||||
needs to be able to tell the iBGP eVPN session which MAC addresses should be forwarded to it. This
|
||||
rules out the simple case of L2XC because there, no learning is performed. The corollary is that a
|
||||
bridge-domain is required for any form of eVPN.
|
||||
|
||||
The L2 code in VPP already does most of what I'd need. It maintains an L2FIB in `vnet/l2/l2_fib.c`,
|
||||
which is keyed by bridge-id and MAC address, and its values are a 64 bit structure that points
|
||||
essentially to a sw_if_index output. The L2FIB of the eVPN needs a bit more information though,
|
||||
notably a `ip46address` struct to know which VTEP to send to. It's tempting to add this extra data
|
||||
to the bridge domain code. I would recommend against it, because other implementations, for example
|
||||
MPLS, GENEVE or Carrier Pigeon IP may need more than just the destination address. Even the VxLAN
|
||||
implementation I'm thinking about might want to be able to override other things like the
|
||||
essentially to a `sw_if_index` output interface. The L2FIB of the eVPN needs a bit more information
|
||||
though, notably a `ip46address` struct to know which VTEP to send to. It's tempting to add this
|
||||
extra data to the bridge domain code. I would recommend against it, because other implementations,
|
||||
for example MPLS, GENEVE or Carrier Pigeon IP may need more than just the destination address. Even
|
||||
the VxLAN implementation I'm thinking about might want to be able to override other things like the
|
||||
destination port for a given VTEP, or even the VNI. Putting all of this stuff in the bridge-domain
|
||||
code will just clutter it, for all users, not just those users who might want eVPN.
|
||||
|
||||
@ -211,19 +212,19 @@ receiving a few types of eVPN messages. The ones I'm interested in are:
|
||||
- Similar to IP information in Type 2, we can talk about those later once L3VPN/eVPN is
|
||||
needed.
|
||||
|
||||
The 'on the way in' stuff can be easily done with the proposed APIs in the Dynamic VxLAN plugin.
|
||||
Adding, removing, listing L2FIB and Flood lists is easy as far as VPP is concerned. It's just that
|
||||
the controlplane implementation needs to somehow _feed_ the API, so an external program may be
|
||||
needed, or alterntively the Linux Control Plane netlink plugin might be used to consume this
|
||||
information.
|
||||
The 'on the way in' stuff can be easily done with my proposed APIs in the Dynamic VxLAN (or a new
|
||||
eVPN VxLAN) plugin. Adding, removing, listing L2FIB and Flood lists is easy as far as VPP is
|
||||
concerned. It's just that the controlplane implementation needs to somehow _feed_ the API, so an
|
||||
external program may be needed, or alterntively the Linux Control Plane netlink plugin might be used
|
||||
to consume this information.
|
||||
|
||||
The 'on the way out' stuff is a bit trickier. I will need to listen to creation of new broadcast
|
||||
domains and associate them with the right IMET announcements, and for each MAC address learned, pick
|
||||
them up and advertise them into eVPN. Later, once ARP and ND proxying is a thing, I'll have to
|
||||
revisit the bridge-domain feature to do IPv4 ARP and IPv6 Neighbor Discovery, and replace it with
|
||||
some code that populates the IPv4/IPv6 parts of the Type2 messages on the way out, and similarly on
|
||||
the way in, populates an L3 neighbor cache for the bridge domain, so ARP and ND replies can be
|
||||
synthesized based on what we've learned in eVPN.
|
||||
them up and advertise them into eVPN. Later, if ever ARP and ND proxying becomes important, I'll
|
||||
have to revisit the bridge-domain feature to do IPv4 ARP and IPv6 Neighbor Discovery, and replace it
|
||||
with some code that populates the IPv4/IPv6 parts of the Type2 messages on the way out, and
|
||||
similarly on the way in, populates an L3 neighbor cache for the bridge domain, so ARP and ND replies
|
||||
can be synthesized based on what we've learned in eVPN.
|
||||
|
||||
# Demonstration
|
||||
|
||||
@ -233,22 +234,22 @@ I'll build a small demo environment on Summer to show how the interaction of VxL
|
||||
Domain works today:
|
||||
|
||||
```
|
||||
create tap host-if-name dummy0 host-mtu-size 9216 host-ip4-addr 192.0.2.1/24
|
||||
set int state tap0 up
|
||||
set int ip address tap0 192.0.2.1/24
|
||||
set ip neighbor tap0 192.0.2.254 01:02:03:82:98:fe static
|
||||
set ip neighbor tap0 192.0.2.2 01:02:03:82:98:02 static
|
||||
set ip neighbor tap0 192.0.2.3 01:02:03:82:98:03 static
|
||||
vpp# create tap host-if-name dummy0 host-mtu-size 9216 host-ip4-addr 192.0.2.1/24
|
||||
vpp# set int state tap0 up
|
||||
vpp# set int ip address tap0 192.0.2.1/24
|
||||
vpp# set ip neighbor tap0 192.0.2.254 01:02:03:82:98:fe static
|
||||
vpp# set ip neighbor tap0 192.0.2.2 01:02:03:82:98:02 static
|
||||
vpp# set ip neighbor tap0 192.0.2.3 01:02:03:82:98:03 static
|
||||
|
||||
create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298
|
||||
set int state vxlan_tunnel0 up
|
||||
vpp# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298
|
||||
vpp# set int state vxlan_tunnel0 up
|
||||
|
||||
create tap host-if-name vpptap0 host-mtu-size 9216 hw-addr 02:fe:64:dc:1b:82
|
||||
set int state tap1 up
|
||||
vpp# create tap host-if-name vpptap0 host-mtu-size 9216 hw-addr 02:fe:64:dc:1b:82
|
||||
vpp# set int state tap1 up
|
||||
|
||||
create bridge-domain 8298
|
||||
set int l2 bridge tap1 8298
|
||||
set int l2 bridge vxlan_tunnel0 8298
|
||||
vpp# create bridge-domain 8298
|
||||
vpp# set int l2 bridge tap1 8298
|
||||
vpp# set int l2 bridge vxlan_tunnel0 8298
|
||||
```
|
||||
|
||||
I've created a tap device called `dummy0` and gave it an IPv4 address. Normally, I would use some
|
||||
@ -307,9 +308,8 @@ pim@summer:~$ sudo tcpdump -evni dummy0
|
||||
192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
|
||||
```
|
||||
|
||||
I want to point out that nothing, so far, has anything to do with my gerrit change, all of this
|
||||
works with upstream VPP just fine. I can see two VxLAN encapsulated packets, both destined to
|
||||
`192.0.2.254:4789`. Cool.
|
||||
I want to point out that nothing, so far, is special. All of this works with upstream VPP just fine.
|
||||
I can see two VxLAN encapsulated packets, both destined to `192.0.2.254:4789`. Cool.
|
||||
|
||||
### Dynamic VPP VxLAN
|
||||
|
||||
|
Reference in New Issue
Block a user