Add article on SR Linux + Arista EVPN
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
2025-04-09 22:25:51 -05:00
parent 01820776af
commit d9e2f407e7
7 changed files with 771 additions and 0 deletions

View File

@ -0,0 +1,757 @@
---
date: "2025-04-09T07:51:23Z"
title: 'FrysIX eVPN: think different'
---
{{< image float="right" src="/assets/frys-ix/frysix-logo-small.png" alt="FrysIX Logo" width="12em" >}}
# Introduction
Somewhere in the far north of the Netherlands, the country where I was born, a town called Jubbega
is the home of the Frysian Internet Exchange called [[Frys-IX](https://frys-ix.net/)]. Back in 2021,
a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHEF facility, one of
the most densely populated facilities in western Europe. He was looking for a few launching
customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on
my [[bucketlist]({{< ref 2021-07-26-bucketlist.md >}})]. Arend and his IT company
[[ERITAP](https://www.eritap.com/)], took delivery in May of 2021, and this is when the internet
exchange with _Frysian roots_ was born.
In the years from 2021 until now, Arend and I have been operating the exchange with reasonable
success. It grew from a handful of folks in that first rack, to now some 250 participating ISPs
with about ten switches in six datacenters across the Amsterdam metro area. It's shifting a cool
800Gbit of traffic or so. It's dope, and very rewarding to be a part of this community!
## Frys-IX is growing
We have several members with a 2x100G LAG and even though all inter-datacenter links are either dark
fiber or WDM, we're starting to feel the growing pains as we set our sights to the next step growth.
You see, when FrysIX did 13.37Gbit of traffic, Arend organized a barbecue. When it did 133.7Gbit of
traffic, Arend organized an even bigger barbecue. Obviously, the next step is 1337Gbit and joining
the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're coming!
It became clear that we would not be able to keep a dependable peering platform if FrysIX was a
single L2 broadcast domain, and it also became clear that concatenating multiple 100G ports would be
operationally expensive (think of all the dark fiber or WDM waves!), and brittle (think of LACP and
balancing traffic over those ports). We need to modernize in order to stay ahead of the growth
curve.
## Hello Nokia
{{< image float="right" src="/assets/frys-ix/nokia-7220-d4.png" alt="Nokia 7220-D4" width="20em" >}}
The Nokia 7220 Interconnect Router (7220 IXR) for data center fabric provides fixed-configuration,
high-capacity platforms that let you bring unmatched scale, flexibility and operational simplicity
to your data center networks and peering network environments. These devices are built around the
Broadcom _Trident_ chipset, in the case of the lefthand "D4" platform, this is a Trident4 with
28x100G and 8x400G ports.
{{< image float="right" src="/assets/frys-ix/IXR-7220-D3.jpg" alt="Nokia 7220-D3" width="20em" >}}
What I find particularly awesome of the Trident series is their speed (total bandwidth of
12.8Tbps _per router_), low power use (without optics, the IXR-7220-D4 consumes about 150W) and
a plethora of advanced capabilities like L2/L3 filtering, IPv4, IPv6 and MPLS routing, and modern
approaches to scale-out networking such as VXLAN based EVPN. At the FrysIX barbecue in September of
2024, FrysIX was gifted a rather powerful IXR-7220-D3 router, shown in the picture to the right.
ERITAP has bought two (new in box) IXR-7220-D4 (8x400G,28x100G) routers, and has also acquired two
IXR-7220-D2 (48x25G,8x100G) routers. So in total, FrysIX is now the proud owner of five of these
beautiful Nokia devices. If you haven't yet, you should definitely read about these versatile
routers on the [[Nokia](https://onestore.nokia.com/asset/207599)] website, and some details of the
_merchant silicon_ switch chips in use on the
[[Broadcom](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56880-series)]
website.
### eVPN: A small rant
{{< image float="right" src="/assets/frys-ix/FrysIX_ Topology (concept).svg" alt="Topology Concept" width="50%" >}}
First, I need to get something off my chest. Consider a topology for an internet exchange platform,
taking into account the available equipment, rackspace, power, and cross connects. Somehow, almost
every design or reference architecture I can find on the Internet, assumes folks want to build a
[[Clos network](https://en.wikipedia.org/wiki/Clos_network)], which has a topology existing of leaf
and spine switches. The _spine_ switches have a different set of features than the _leaf_ ones,
notably they don't have to do provider edge functionality like VXLAN encap and decapsulation.
Almost all of these designs are showing how one might build a leaf-spine network for hyperscale.
**Critique 1**: my 'spine' (IXR-7220-D4 routers) must also be provider edge. Practically speaking,
in the picture above I have these beautiful Nokia IXR-7220-D4 switches, using two 400G ports to
connect between the facilities, and six 100G ports to connect the smaller breakout switches. That
would leave a _massive_ amount of capacity unused: 22x 100G and 6x400G ports, to be exact.
**Critique 2**: all 'leaf' (either IXR-7220-D2 routers or Arista switches) can't realistically
connect to both 'spines'. Our devices are spread out over two (and in practice, more like six)
datacenters, and it's prohibitively expensive to get 100G waves or dark fiber to create a full mesh.
It's much more economical to create a star-topology that minimizes cross-datacenter fiber spans.
**Critique 3**: Most of these 'spine-leaf' reference architectures assume that the interior gateway
protocol is EBGP in what they call the _underlay_, and on top of that, some secondary EBGP that's
called the _overlay_. Frankly, such a design makes my head spin a little bit. These designs assume
hundreds of switches, in which case making use of one AS number per switch could make sense (as iBGP
needs either a 'full mesh', or external route reflectors).
Setting aside eVPN for a second, if I were to build a transport network, much like [[IPng Site
Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a simpler design:
1. Take a classic IGP like [[OSPF](https://en.wikipedia.org/wiki/Open_Shortest_Path_First)], or
perhaps [[IS-IS](https://en.wikipedia.org/wiki/IS-IS)]. There is no benefit, to me at least, to use
BGP as an IGP.
1. I would give each of the links between the switches an IPv4 /31 and enable link-local, and give
each switch a loopback address with a /32 IPv4 and a /128 IPv6.
1. If I had multiple links between two given switches, I would probably just use ECMP if my devices
supported it, and fall back to a LACP signaled bundle-ethernet otherwise.
1. If I were to need to use BGP (and for eVPN, this need exists), taking the ISP mindset (as opposed
to the datacenter fabric mindset), I would simply install iBGP against two or three route
reflectors, and exchange routing information within the same single AS number.
### eVPN: A demo topology
{{< image float="right" src="/assets/frys-ix/Nokia Arista VXLAN.svg" alt="Demo topology" width="50%" >}}
So, that's exactly how I'm going to approach the FrysIX eVPN design: OSPF for the underlay and iBGP
for the overlay! I have a feeling that some folks will dispise me for being contrarian, but you can
leave your comments below, and don't forget to like-and-subscribe :-)
Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two
400G-capable switches and connects them. Then he takes an Arista DCS-7060CX switch (which is eVPN
capable, with 32x100G ports, based on the Broadcom Tomahawk3 chipset), and a smaller Nokia
IXR-7220-D2 (with 48x25G and 8x100G ports, based on the Trident3 chipset).
#### Underlay: Nokia's SR Linux
We boot up the lab, verify that all the optics and links are up, and connect the management ports to
an OOB network that I can remotely log in to. This is the first time that either of us work on
Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek.
```
[linuxadmin@nikhef ~]$ sr_cli
--{ running }--[ ]--
A:linuxadmin@nikhef# enter candidate
--{ candidate shared default }--[ ]--
A:linuxadmin@nikhef# set / interface lo0 admin-state enable
A:linuxadmin@nikhef# set / interface lo0 subinterface 0 admin-state enable
A:linuxadmin@nikhef# set / interface lo0 subinterface 0 ipv4 admin-state enable
A:linuxadmin@nikhef# set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32
A:linuxadmin@nikhef# commit stay
```
There, my first config snippet! This creates a _loopback_ interface, and similar to JunOS, a
_subinterface_ (which Juniper calls a _unit_) which enables IPv4 and gives it an /32 address. In SR
Linux, any interface has to be associated with a _network-instance_, think of those as routing
domains or VRFs. There's a conveniently named _default_ network-instance, which I'll add this and
the point-to-point interface between the two 400G routers to:
```
A:linuxadmin@nikhef# info flat interface ethernet-1/29
set / interface ethernet-1/29 admin-state enable
set / interface ethernet-1/29 subinterface 0 admin-state enable
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
A:linuxadmin@nikhef# set / network-instance default type default
A:linuxadmin@nikhef# set / network-instance default admin-state enable
A:linuxadmin@nikhef# set / network-instance default interface ethernet-1/29.0
A:linuxadmin@nikhef# set / network-instance default interface lo0.0
A:linuxadmin@nikhef# commit stay
```
Cool. Assuming I now also do this on the other IXR-7220-D4 router, called _equinix_ (which gets the
loopback address 198.19.16.0/32 and the point-to-point on the 400G interface of 198.19.17.0/31), I
should be able to do my first ping:
```
A:linuxadmin@equinix# ping network-instance default 198.19.17.1 -s 9162 -M do
Using network instance default
PING 198.19.17.1 (198.19.17.1) 9162(9190) bytes of data.
9170 bytes from 198.19.17.1: icmp_seq=1 ttl=64 time=0.466 ms
9170 bytes from 198.19.17.1: icmp_seq=2 ttl=64 time=0.477 ms
9170 bytes from 198.19.17.1: icmp_seq=3 ttl=64 time=0.547 ms
```
#### Underlay: SR Linux OSPF
OK, let's get these two Nokia routers to speak OSPF, so that they can reach each others' loopbacks.
It's really easy:
```
A:linuxadmin@nikhef# / network-instance default protocols ospf instance default
--{ candidate shared default }--[ network-instance default protocols ospf instance default ]--
A:linuxadmin@nikhef# set admin-state enable
A:linuxadmin@nikhef# set version ospf-v2
A:linuxadmin@nikhef# set router-id 198.19.16.1
A:linuxadmin@nikhef# set area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
A:linuxadmin@nikhef# set area 0.0.0.0 interface lo0.0 passive true
A:linuxadmin@nikhef# commit stay
```
Similar to in JunOS, I can descend into a configuration scope (the first line goes into the
_network-instance_ called `default` and then the _protocols_ called `ospf`, and then the _instance_
called `default`. Subsequent `set` commands operate at this scope. Once I commit this configuration
(on the nikhef router and also the equinix router, with its own router-id), OSPF shoots to life
immediately:
```
A:linuxadmin@nikhef# show network-instance default protocols ospf neighbor
=========================================================================================
Net-Inst default OSPFv2 Instance default Neighbors
=========================================================================================
+---------------------------------------------------------------------------------------+
| Interface-Name Rtr Id State Pri RetxQ Time Before Dead |
+=======================================================================================+
| ethernet-1/29.0 198.19.16.0 full 1 0 36 |
+---------------------------------------------------------------------------------------+
-----------------------------------------------------------------------------------------
No. of Neighbors: 1
=========================================================================================
A:linuxadmin@nikhef# show network-instance default route-table all | more
IPv4 unicast route table of network instance default
+------------------+-----+------------+--------------+--------+----------+--------+------+-------------+-----------------+
| Prefix | ID | Route Type | Route Owner | Active | Origin | Metric | Pref | Next-hop | Next-hop |
| | | | | | Network | | | (Type) | Interface |
| | | | | | Instance | | | | |
+==================+=====+============+==============+========+==========+========+======+=============+=================+
| 198.19.16.0/32 | 0 | ospfv2 | ospf_mgr | True | default | 1 | 10 | 198.19.17.0 | ethernet-1/29.0 |
| | | | | | | | | (direct) | |
| 198.19.16.1/32 | 7 | host | net_inst_mgr | True | default | 0 | 0 | None | None |
| 198.19.17.0/31 | 6 | local | net_inst_mgr | True | default | 0 | 0 | 198.19.17.1 | ethernet-1/29.0 |
| | | | | | | | | (direct) | |
| 198.19.17.1/32 | 6 | host | net_inst_mgr | True | default | 0 | 0 | None | None |
+==================+=====+============+==============+========+==========+========+======+=============+=================+
A:linuxadmin@nikhef# ping network-instance default 198.19.16.0
Using network instance default
PING 198.19.16.0 (198.19.16.0) 56(84) bytes of data.
64 bytes from 198.19.16.0: icmp_seq=1 ttl=64 time=0.484 ms
64 bytes from 198.19.16.0: icmp_seq=2 ttl=64 time=0.663 ms
```
Delicious! OSPF has learned the loopback, and it is now reachable. As with most things, going from 0
to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going
from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on,
going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on
the Nikhef router, using ethernet-1/1.0 through ethernet-1/4.0 with the correct MTU and turning on OSPF
for these), makes the whole network shoot to life.
#### Underlay: Arista
I'll point out that one of the devices in this topology is an Arista. We have several of these ready
for deployment at FrysIX. They are a lot more affordable, come with 32x100G ports, and are really
good at packet slinging because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less
faetures than the _Trident_ chipset, but they happen to have all the features we need to run our
internet exchange . So I turn my attention to the Arista in the topology. I am much more comfortable
configuring the whole thing here, as it's not my first time touching these devices:
```
arista-leaf#show run int loop0
interface Loopback0
ip address 198.19.16.2/32
ip ospf area 0.0.0.0
arista-leaf#show run int Ethernet32/1
interface Ethernet32/1
description Core: Connected to nikhef:ethernet-1/2
load-interval 1
mtu 9190
no switchport
ip address 198.19.17.5/31
ip ospf cost 1000
ip ospf network point-to-point
ip ospf area 0.0.0.0
arista-leaf#show run section router ospf
router ospf 65500
router-id 198.19.16.2
redistribute connected
network 198.19.0.0/16 area 0.0.0.0
max-lsa 12000
```
I complete the configuration for the other two core ports on this Arista, port Eth31/1 connects also
to the nikhef IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to
the nokia-leaf IXR-7220-D2 with a cost of 10.
It's nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via
router-id 198.19.16.1 (nikhef), and there's one lower cost path via router-id 198.19.16.3
(nokia-leaf). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia ->
equinix).
```
arista-leaf#show ip ospf nei
Neighbor ID Instance VRF Pri State Dead Time Address Interface
198.19.16.1 65500 default 1 FULL 00:00:36 198.19.17.4 Ethernet32/1
198.19.16.3 65500 default 1 FULL 00:00:31 198.19.17.11 Ethernet30/1
198.19.16.1 65500 default 1 FULL 00:00:35 198.19.17.2 Ethernet31/1
arista-leaf#traceroute 198.19.16.0
traceroute to 198.19.16.0 (198.19.16.0), 30 hops max, 60 byte packets
1 198.19.17.11 (198.19.17.11) 0.220 ms 0.150 ms 0.206 ms
2 198.19.17.6 (198.19.17.6) 0.169 ms 0.107 ms 0.099 ms
3 198.19.16.0 (198.19.16.0) 0.434 ms 0.346 ms 0.303 ms
```
So far, so good! The _underlay_ is up, every router can reach every other router on its loopback,
and all OSPF adjacencies are formed. I'll leave the 2x100G between _nikhef_ and _arista-leaf_ at
high cost for now.
#### Overlay EVPN: SR Linux
The big-idea here is to use iBGP with the same AS number, and because there are two main facilities
(NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as route-reflectors for
others. It means that they will have an iBGP session amongst themselves (198.191.16.0 <->
198.19.16.1) and otherwise accept iBGP sessions from any IP address in the 198.19.16.0/24 subnet.
This way, I don't have to configure any more than strictly necessary on the core routers. Any new
router can just plug in, form an OSPF adjacency, and connect to both core routers. I proceed to
configure the Nokia's like this:
```
A:linuxadmin@nikhef# / network-instance default protocols bgp
A:linuxadmin@nikhef# set admin-state enable
A:linuxadmin@nikhef# set autonomous-system 65500
A:linuxadmin@nikhef# set router-id 198.19.16.1
A:linuxadmin@nikhef# set dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
A:linuxadmin@nikhef# set afi-safi evpn admin-state enable
A:linuxadmin@nikhef# set preference ibgp 170
A:linuxadmin@nikhef# set route-advertisement rapid-withdrawal true
A:linuxadmin@nikhef# set route-advertisement wait-for-fib-install false
A:linuxadmin@nikhef# set group overlay peer-as 65500
A:linuxadmin@nikhef# set group overlay afi-safi evpn admin-state enable
A:linuxadmin@nikhef# set group overlay afi-safi ipv4-unicast admin-state disable
A:linuxadmin@nikhef# set group overlay afi-safi ipv6-unicast admin-state disable
A:linuxadmin@nikhef# set group overlay local-as as-number 65500
A:linuxadmin@nikhef# set group overlay route-reflector client true
A:linuxadmin@nikhef# set group overlay transport local-address 198.19.16.1
A:linuxadmin@nikhef# set neighbor 198.19.16.0 admin-state enable
A:linuxadmin@nikhef# set neighbor 198.19.16.0 peer-group overlay
A:linuxadmin@nikhef# commit stay
```
I can see that iBGP sessions establish between all the devices:
```
A:linuxadmin@nikhef# show network-instance default protocols bgp neighbor
---------------------------------------------------------------------------------------------------------------------------
BGP neighbor summary for network-instance "default"
Flags: S static, D dynamic, L discovered by LLDP, B BFD enabled, - disabled, * slow
---------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
| Net-Inst | Peer | Group | Flags | Peer-AS | State | Uptime | AFI/SAFI | [Rx/Active/Tx] |
+=============+=============+==========+=======+==========+=============+===============+============+====================+
| default | 198.19.16.0 | overlay | S | 65500 | established | 0d:0h:2m:32s | evpn | [0/0/0] |
| default | 198.19.16.2 | overlay | D | 65500 | established | 0d:0h:2m:27s | evpn | [0/0/0] |
| default | 198.19.16.3 | overlay | D | 65500 | established | 0d:0h:2m:41s | evpn | [0/0/0] |
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
---------------------------------------------------------------------------------------------------------------------------
Summary:
1 configured neighbors, 1 configured sessions are established, 0 disabled peers
2 dynamic peers
```
A few things to note here - there one _configured_ neighbor (this is the other IXR-7220-D4 router),
and two _dynamic_ peers, these are the Arista and the smaller IXR-7220-D2 router. The only address
family that they are exchanging information for is the _evpn_ family, and no prefixes have been
learned or sent het (that's the `[0/0/0]` designation in the last column).
#### Overlay EVPN: Arista
The Arista is also remarkably straight forward to configure. Here, I'll simply enable the iBGP
session as follows:
```
arista-leaf#show run section bgp
router bgp 65500
neighbor evpn peer group
neighbor evpn remote-as 65500
neighbor evpn update-source Loopback0
neighbor evpn ebgp-multihop 3
neighbor evpn send-community extended
neighbor evpn maximum-routes 12000 warning-only
neighbor 198.19.16.0 peer group evpn
neighbor 198.19.16.1 peer group evpn
!
address-family evpn
neighbor evpn activate
arista-leaf#show bgp summary
BGP summary information for VRF default
Router identifier 198.19.16.2, local AS number 65500
Neighbor AS Session State AFI/SAFI AFI/SAFI State NLRI Rcd NLRI Acc
----------- ----------- ------------- ----------------------- -------------- ---------- ----------
198.19.16.0 65500 Established IPv4 Unicast Advertised 0 0
198.19.16.0 65500 Established L2VPN EVPN Negotiated 0 0
198.19.16.1 65500 Established IPv4 Unicast Advertised 0 0
198.19.16.1 65500 Established L2VPN EVPN Negotiated 0 0
```
On this leaf node, I'll have a redundant iBGP session with the two core nodes. Since those core
nodes are peering amongst themselves, and are configured as route-reflectors, this is all I need. No
matter how many additional Arista (or Nokia) devices I add to the network, all they'll have to do is
enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sesions with both. Voila!
#### VXLAN EVPN: SR Linux
Nokia informs me that it uses a special interface called _system0_ to source its VXLAN traffic from.
So it's a matter of defining that interface and associate a VXLAN interface with it, like so:
```
A:linuxadmin@nikhef# set / interface system0 admin-state enable
A:linuxadmin@nikhef# set / interface system0 subinterface 0 admin-state enable
A:linuxadmin@nikhef# set / interface system0 subinterface 0 ipv4 admin-state enable
A:linuxadmin@nikhef# set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32
A:linuxadmin@nikhef# set / network-instance default interface system0.0
A:linuxadmin@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
A:linuxadmin@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
A:linuxadmin@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
A:linuxadmin@nikhef# commit stay
```
This creates the plumbing for a VXLAN sub-interface called `vxlan1.2604` which will accept/send
traffic using VNI 2604 (this happens to be the VLAN id we use at FrysIX for our production Peering
LAN), and it'll use the `system0.0` address to source that traffic from.
The second part is to create what SR Linux calls a MAC-VRF and put some interface in it:
```
A:linuxadmin@nikhef# set / interface ethernet-1/9 admin-state enable
A:linuxadmin@nikhef# set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
A:linuxadmin@nikhef# set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
A:linuxadmin@nikhef# set / interface ethernet-1/9/3 admin-state enable
A:linuxadmin@nikhef# set / interface ethernet-1/9/3 vlan-tagging true
A:linuxadmin@nikhef# set / interface ethernet-1/9/3 subinterface 0 type bridged
A:linuxadmin@nikhef# set / interface ethernet-1/9/3 subinterface 0 admin-state enable
A:linuxadmin@nikhef# set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
A:linuxadmin@nikhef# / network-instance peeringlan
A:linuxadmin@nikhef# set type mac-vrf
A:linuxadmin@nikhef# set admin-state enable
A:linuxadmin@nikhef# set interface ethernet-1/9/3.0
A:linuxadmin@nikhef# set vxlan-interface vxlan1.2604
A:linuxadmin@nikhef# set protocols bgp-evpn bgp-instance 1 admin-state enable
A:linuxadmin@nikhef# set protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
A:linuxadmin@nikhef# set protocols bgp-evpn bgp-instance 1 evi 2604
A:linuxadmin@nikhef# set protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
A:linuxadmin@nikhef# set protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
A:linuxadmin@nikhef# set protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
A:linuxadmin@nikhef# commit stay
```
In the first block here, I take what is a 100G port called `ethernet-1/9` and I split it into 4x25G
ports. I'll force the port speed to 10G because Arend has taken a 40G-4x10G DAC, and it happens that
the third lane is plugged into the Debian machine. So on `ethernet-1/9/3` I'll create a
sub-interface, make it type _bridged_ (which I've also done on `vxlan1.2604`!) and allow any
untagged traffic to enter it.
{{< image width="5em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
If you, like me, are used to either VPP or IOS/XR, this type of sub-interface stuff should feel very
natural to you. I've written about the sub-interfaces logic on Cisco's IOS/XR and VPP approach in a
previous [[article]({{< ref 2022-02-14-vpp-vlan-gym.md >}})] which my buddy Fred lovingly calls
_VLAN Gymnastics_ because the ports are just so damn flexible. Worth a read!
The second block creates a new _network-instance_ which I'll name `peeringlan`, and it associates
the newly crated untagged sub-interface ethernet-1/9/3.0 with with the VXLAN interface, and starts a
protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the
VXLAN interface, and signalling of all MAC addresses learned to use route-distinguisher and
import/export route-targets. For simplicity I've just used the same for each: 65500:2604.
I continue to add an interface to the `peeringlan` _network-instance_ on the other two Nokia
routers: `ethernet-1/9/3.0` on the equinix router and `ethernet-1/9.0` on the nokia-leaf router.
Each of these goes to a 10Gbps port on a Debian machine.
#### VXLAN EVPN: Arista
At this point I'm feeling pretty bullish about the whole project. Arista does not make it very
difficult on me to configure it for L2 EVPN (which is called MAC-VRF here also):
```
arista-leaf#conf t
vlan 2604
name v-peeringlan
interface Ethernet9/3
speed forced 10000full
switchport access vlan 2604
interface Loopback1
ip address 198.19.18.2/32
interface Vxlan1
vxlan source-interface Loopback1
vxlan udp-port 4789
vxlan vlan 2604 vni 2604
```
After creating VLAN 2604 on making port Eth9/3 an access port in that VLAN, I'll add a VTEP endpoint
called `Loopback1`, and a VXLAN interface that uses that to source its traffic. Here, I'll associate
local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias.
Finally, it's a matter of tying these together by announcing the MAC addresses into the EVPN iBGP
sessions:
```
arista-leaf#conf t
router bgp 65500
vlan 2604
rd 65500:2604
route-target both 65500:2604
redistribute learned
!
```
### Results
To validate the configurations, I learn a cool trick from my buddy Andy on the SR Linux discord
server. In EOS, I can ask it to check for any obvious mistakes in two places:
```
arista-leaf#show vxlan config-sanity detail
Category Result Detail
---------------------------------- -------- --------------------------------------------------
Local VTEP Configuration Check OK
Loopback IP Address OK
VLAN-VNI Map OK
Flood List OK
Routing OK
VNI VRF ACL OK
Decap VRF-VNI Map OK
VRF-VNI Dynamic VLAN OK
Remote VTEP Configuration Check OK
Remote VTEP OK
Platform Dependent Check OK
VXLAN Bridging OK
VXLAN Routing OK VXLAN Routing not enabled
CVX Configuration Check OK
CVX Server OK Not in controller client mode
MLAG Configuration Check OK Run 'show mlag config-sanity' to verify MLAG config
Peer VTEP IP OK MLAG peer is not connected
MLAG VTEP IP OK
Peer VLAN-VNI OK
Virtual VTEP IP OK
MLAG Inactive State OK
arista-leaf#show bgp evpn sanity detail
Category Check Status Detail
-------- -------------------- ------ ------
General Send community OK
General Multi-agent mode OK
General Neighbor established OK
L2 MAC-VRF route-target OK
import and export
L2 MAC-VRF OK
route-distinguisher
L2 MAC-VRF redistribute OK
L2 MAC-VRF overlapping OK
VLAN
L2 Suppressed MAC OK
VXLAN VLAN to VNI map for OK
MAC-VRF
VXLAN VRF to VNI map for OK
IP-VRF
```
#### Results: Arista view
Inspecting the MAC addresses learned from all four of the client ports on the Debian machine is
easy:
```
arista-leaf#show bgp evpn summary
BGP summary information for VRF default
Router identifier 198.19.16.2, local AS number 65500
Neighbor Status Codes: m - Under maintenance
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
198.19.16.0 4 65500 3311 3867 0 0 18:06:28 Estab 7 7
198.19.16.1 4 65500 3308 3873 0 0 18:06:28 Estab 7 7
arista-leaf#show bgp evpn vni 2604 next-hop 198.19.18.3
BGP routing table information for VRF default
Router identifier 198.19.16.2, local AS number 65500
Route status codes: * - valid, > - active, S - Stale, E - ECMP head, e - ECMP
c - Contributing to ECMP, % - Pending BGP convergence
Origin codes: i - IGP, e - EGP, ? - incomplete
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop
Network Next Hop Metric LocPref Weight Path
* >Ec RD: 65500:2604 mac-ip e43a.6e5f.0c59
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1
* ec RD: 65500:2604 mac-ip e43a.6e5f.0c59
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0
* >Ec RD: 65500:2604 imet 198.19.18.3
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1
* ec RD: 65500:2604 imet 198.19.18.3
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0
```
There's a lot to unpack here! The Arista is seeing that from the _route-discriminator_ I configured
on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for
the nokia-leaf router) from both iBGP sessions. The MAC address is learned from originator
198.19.16.3 (the loopback of the nokia-leaf router), from two cluster members, the _active_ one on
iBGP speaker 198.19.16.1 (nikhef) and a backup member on 198.19.16.0 (equinix).
I can also see that there's a bunch of `imet` route entries, and Andy explained these to me. They are
a signal from a VTEP participant that they are interested in seeing multicast traffic (like neighbor
discovery or ARP requests) flooded to them. Every router participating in this L2VPN will raise such
an `imet` route, which I'll see in duplicates as well (one from each iBGP session). This checks out.
#### Results: SR Linux view
The Nokia IXR-7220-D4 router called _equinix_ has also learned a bunch of EVPN routing entries,
which I can inspect as follows:
```
A:linuxadmin@equinix# show network-instance default protocols bgp routes evpn route-type summary
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Show report for the BGP route table of network-instance "default"
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Status codes: u=used, *=valid, >=best, x=stale, b=backup
Origin codes: i=IGP, e=EGP, ?=incomplete
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
BGP Router ID: 198.19.16.0 AS: 65500 Local AS: 65500
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Type 2 MAC-IP Advertisement Routes
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
| Status | Route- | Tag-ID | MAC-address | IP-address | neighbor | Path-| Next-Hop | Label | ESI | MAC Mobility |
| | distinguisher | | | | | id | | | | |
+========+===============+========+===================+============+=============+======+============-+========+================================+==================+
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:57 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.1 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
| * | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.2 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
| * | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.3 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Type 3 Inclusive Multicast Ethernet Tag Routes
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
| Status | Route-distinguisher | Tag-ID | Originator-IP | neighbor | Path- | Next-Hop |
| | | | | | id | |
+========+=============================+========+=====================+=================+========+=======================+
| u*> | 65500:2604 | 0 | 198.19.18.1 | 198.19.16.1 | 0 | 198.19.18.1 |
| * | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.1 | 0 | 198.19.18.2 |
| u*> | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.2 | 0 | 198.19.18.2 |
| * | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.1 | 0 | 198.19.18.3 |
| u*> | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.3 | 0 | 198.19.18.3 |
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
--------------------------------------------------------------------------------------------------------------------------
0 Ethernet Auto-Discovery routes 0 used, 0 valid
5 MAC-IP Advertisement routes 3 used, 5 valid
5 Inclusive Multicast Ethernet Tag routes 3 used, 5 valid
0 Ethernet Segment routes 0 used, 0 valid
0 IP Prefix routes 0 used, 0 valid
0 Selective Multicast Ethernet Tag routes 0 used, 0 valid
0 Selective Multicast Membership Report Sync routes 0 used, 0 valid
0 Selective Multicast Leave Sync routes 0 used, 0 valid
--------------------------------------------------------------------------------------------------------------------------
```
I have to say, SR Linux is incredibly chatty! But, I can see all the relevant bits and bobs here.
Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch, one
pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the IMET
entries. One thing to note -- the SR Linux implementation leaves the type-2 routes empty with a
0.0.0.0 IPv4 address, while the Arista (in my opinion, more correctly) leaves them as NULL
(unspecified). But, everything looks great!
#### Results: Debian view
There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. Arend
hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+ connections.
This network card is a regular in my AS8298 network, as it has excellent DPDK support and can pump
easily 40Mpps with VPP. IPng 🥰 Intel X710!
```
root@debian:~ # ip netns add nikhef
root@debian:~ # ip link set enp1s0f0 netns nikhef
root@debian:~ # ip netns exec nikhef ip link set enp1s0f0 up mtu 9000
root@debian:~ # ip netns exec nikhef ip addr add 192.0.2.10/24 dev enp1s0f0
root@debian:~ # ip netns exec nikhef ip addr add 2001:db8::10/64 dev enp1s0f0
root@debian:~ # ip netns add arista-leaf
root@debian:~ # ip link set enp1s0f1 netns arista-leaf
root@debian:~ # ip netns exec arista-leaf ip link set enp1s0f1 up mtu 9000
root@debian:~ # ip netns exec arista-leaf ip addr add 192.0.2.11/24 dev enp1s0f1
root@debian:~ # ip netns exec arista-leaf ip addr add 2001:db8::11/64 dev enp1s0f1
root@debian:~ # ip netns add nokia-leaf
root@debian:~ # ip link set enp1s0f2 netns nokia-leaf
root@debian:~ # ip netns exec nokia-leaf ip link set enp1s0f2 up mtu 9000
root@debian:~ # ip netns exec nokia-leaf ip addr add 192.0.2.12/24 dev enp1s0f2
root@debian:~ # ip netns exec nokia-leaf ip addr add 2001:db8::12/64 dev enp1s0f2
root@debian:~ # ip netns add equinix
root@debian:~ # ip link set enp1s0f3 netns equinix
root@debian:~ # ip netns exec equinix ip link set enp1s0f3 up mtu 9000
root@debian:~ # ip netns exec equinix ip addr add 192.0.2.13/24 dev enp1s0f3
root@debian:~ # ip netns exec equinix ip addr add 2001:db8::13/64 dev enp1s0f3
root@debian:~# ip netns exec nikhef fping -g 192.0.2.8/29
192.0.2.10 is alive
192.0.2.11 is alive
192.0.2.12 is alive
192.0.2.13 is alive
root@debian:~# ip netns exec arista-leaf fping 2001:db8::10 2001:db8::11 2001:db8::12 2001:db8::13
2001:db8::10 is alive
2001:db8::11 is alive
2001:db8::12 is alive
2001:db8::13 is alive
root@debian:~# ip netns exec equinix ip nei
192.0.2.10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
192.0.2.11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
192.0.2.12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
fe80::e63a:6eff:fe5f:c57 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
fe80::e63a:6eff:fe5f:c58 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
fe80::e63a:6eff:fe5f:c59 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
2001:db8::10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
2001:db8::11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
2001:db8::12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
```
The Debian machine puts each network card into its own network namespace, and gives it both an IPv4
and an IPv6 address. I can then enter the `nikhef` network namespace, which has its NIC connected to
the IXR-7220-D4 router called _nikhef_, and ping all four endpoints. Similarly, I can enter the
`arista-leaf` namespace and ping6 all four endpoints. Finally, I take a look at the IPv6 and IPv4
neighbor table on the network card that is connected to the Equinix router. All three MAC addresses are
seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability.
Performance? We got that!
```
root@debian:~# ip netns exec equinix iperf3 -c 192.0.2.12
Connecting to host 192.0.2.12, port 5201
[ 5] local 192.0.2.10 port 34598 connected to 192.0.2.12 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.15 GBytes 9.91 Gbits/sec 19 1.52 MBytes
[ 5] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 3 1.54 MBytes
[ 5] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes
[ 5] 3.00-4.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes
[ 5] 4.00-5.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
[ 5] 5.00-6.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
[ 5] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
[ 5] 7.00-8.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
[ 5] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
[ 5] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 24 sender
[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver
iperf Done.
```
## What's Next
There's a few improvements I can make before deploying this architecture to the internet exchange.
Notably:
* the functional equivalent of _port security_, that is to say only allowing one or two MAC
addresses per member port. FrysIX has a strict one-port-one-member-one-MAC rule, and having port
security will greatly improve our resilience.
* SR Linux has the ability to suppress ARP, _even on L2 MAC-VRF_! It's relatively well known for
IRB based setups, but adding this to transparent bridge-domains is possible in Nokia
[[ref](https://documentation.nokia.com/srlinux/22-6/SR_Linux_Book_Files/EVPN-VXLAN_Guide/services-evpn-vxlan-l2.html#configuring_evpn_learning_for_proxy_arp)],
using the syntax of `protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise
true`. This will glean the IP addresses based on intercepted ARP requests, and reduce the need for
BUM flooding. If DE-CIX can do it, so can FrysIX :)
* some automation - although configuring the MAC-VRF across Arista and SR Linux is definitely not
as difficult as I thought, having some automation in place will avoid errors and mistakes. It
would suck if the IXP collapsed because I botched a link drain or PNI configuration!
### Acknowledgements
I am relatively new to EVPN configurations, and wanted to give a shoutout to Andy Whitaker who
jumped in very quickly when I asked a question on the SR Linux Discord. He was gracious with his
time and spent a few hours on a video call with me, explaining EVPN in great detail both for Arista
as well as SR Linux, and in particular wanted to give a big "Thank you!" for helping me understand
symmetric and asymmetric IRB in the context of multivendor EVPN. Andy is about to start a new job at
Nokia, and I wish him all the best. To my friends at Nokia: you caught a good one, Andy is pure
gold!

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 90 KiB

BIN
static/assets/frys-ix/IXR-7220-D3.jpg (Stored with Git LFS) Normal file

Binary file not shown.

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 166 KiB

BIN
static/assets/frys-ix/frysix-logo-small.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/frys-ix/nokia-7220-d2.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/frys-ix/nokia-7220-d4.png (Stored with Git LFS) Normal file

Binary file not shown.