diff --git a/content/articles/2025-04-09-frysix-evpn.md b/content/articles/2025-04-09-frysix-evpn.md new file mode 100644 index 0000000..d8c786f --- /dev/null +++ b/content/articles/2025-04-09-frysix-evpn.md @@ -0,0 +1,757 @@ +--- +date: "2025-04-09T07:51:23Z" +title: 'FrysIX eVPN: think different' +--- + +{{< image float="right" src="/assets/frys-ix/frysix-logo-small.png" alt="FrysIX Logo" width="12em" >}} + +# Introduction + +Somewhere in the far north of the Netherlands, the country where I was born, a town called Jubbega +is the home of the Frysian Internet Exchange called [[Frys-IX](https://frys-ix.net/)]. Back in 2021, +a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHEF facility, one of +the most densely populated facilities in western Europe. He was looking for a few launching +customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on +my [[bucketlist]({{< ref 2021-07-26-bucketlist.md >}})]. Arend and his IT company +[[ERITAP](https://www.eritap.com/)], took delivery in May of 2021, and this is when the internet +exchange with _Frysian roots_ was born. + +In the years from 2021 until now, Arend and I have been operating the exchange with reasonable +success. It grew from a handful of folks in that first rack, to now some 250 participating ISPs +with about ten switches in six datacenters across the Amsterdam metro area. It's shifting a cool +800Gbit of traffic or so. It's dope, and very rewarding to be a part of this community! + +## Frys-IX is growing + +We have several members with a 2x100G LAG and even though all inter-datacenter links are either dark +fiber or WDM, we're starting to feel the growing pains as we set our sights to the next step growth. +You see, when FrysIX did 13.37Gbit of traffic, Arend organized a barbecue. When it did 133.7Gbit of +traffic, Arend organized an even bigger barbecue. Obviously, the next step is 1337Gbit and joining +the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're coming! + +It became clear that we would not be able to keep a dependable peering platform if FrysIX was a +single L2 broadcast domain, and it also became clear that concatenating multiple 100G ports would be +operationally expensive (think of all the dark fiber or WDM waves!), and brittle (think of LACP and +balancing traffic over those ports). We need to modernize in order to stay ahead of the growth +curve. + +## Hello Nokia + +{{< image float="right" src="/assets/frys-ix/nokia-7220-d4.png" alt="Nokia 7220-D4" width="20em" >}} + +The Nokia 7220 Interconnect Router (7220 IXR) for data center fabric provides fixed-configuration, +high-capacity platforms that let you bring unmatched scale, flexibility and operational simplicity +to your data center networks and peering network environments. These devices are built around the +Broadcom _Trident_ chipset, in the case of the lefthand "D4" platform, this is a Trident4 with +28x100G and 8x400G ports. + +{{< image float="right" src="/assets/frys-ix/IXR-7220-D3.jpg" alt="Nokia 7220-D3" width="20em" >}} + +What I find particularly awesome of the Trident series is their speed (total bandwidth of +12.8Tbps _per router_), low power use (without optics, the IXR-7220-D4 consumes about 150W) and +a plethora of advanced capabilities like L2/L3 filtering, IPv4, IPv6 and MPLS routing, and modern +approaches to scale-out networking such as VXLAN based EVPN. At the FrysIX barbecue in September of +2024, FrysIX was gifted a rather powerful IXR-7220-D3 router, shown in the picture to the right. + +ERITAP has bought two (new in box) IXR-7220-D4 (8x400G,28x100G) routers, and has also acquired two +IXR-7220-D2 (48x25G,8x100G) routers. So in total, FrysIX is now the proud owner of five of these +beautiful Nokia devices. If you haven't yet, you should definitely read about these versatile +routers on the [[Nokia](https://onestore.nokia.com/asset/207599)] website, and some details of the +_merchant silicon_ switch chips in use on the +[[Broadcom](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56880-series)] +website. + +### eVPN: A small rant + +{{< image float="right" src="/assets/frys-ix/FrysIX_ Topology (concept).svg" alt="Topology Concept" width="50%" >}} + +First, I need to get something off my chest. Consider a topology for an internet exchange platform, +taking into account the available equipment, rackspace, power, and cross connects. Somehow, almost +every design or reference architecture I can find on the Internet, assumes folks want to build a +[[Clos network](https://en.wikipedia.org/wiki/Clos_network)], which has a topology existing of leaf +and spine switches. The _spine_ switches have a different set of features than the _leaf_ ones, +notably they don't have to do provider edge functionality like VXLAN encap and decapsulation. +Almost all of these designs are showing how one might build a leaf-spine network for hyperscale. + +**Critique 1**: my 'spine' (IXR-7220-D4 routers) must also be provider edge. Practically speaking, +in the picture above I have these beautiful Nokia IXR-7220-D4 switches, using two 400G ports to +connect between the facilities, and six 100G ports to connect the smaller breakout switches. That +would leave a _massive_ amount of capacity unused: 22x 100G and 6x400G ports, to be exact. + +**Critique 2**: all 'leaf' (either IXR-7220-D2 routers or Arista switches) can't realistically +connect to both 'spines'. Our devices are spread out over two (and in practice, more like six) +datacenters, and it's prohibitively expensive to get 100G waves or dark fiber to create a full mesh. +It's much more economical to create a star-topology that minimizes cross-datacenter fiber spans. + +**Critique 3**: Most of these 'spine-leaf' reference architectures assume that the interior gateway +protocol is EBGP in what they call the _underlay_, and on top of that, some secondary EBGP that's +called the _overlay_. Frankly, such a design makes my head spin a little bit. These designs assume +hundreds of switches, in which case making use of one AS number per switch could make sense (as iBGP +needs either a 'full mesh', or external route reflectors). + +Setting aside eVPN for a second, if I were to build a transport network, much like [[IPng Site +Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a simpler design: +1. Take a classic IGP like [[OSPF](https://en.wikipedia.org/wiki/Open_Shortest_Path_First)], or +perhaps [[IS-IS](https://en.wikipedia.org/wiki/IS-IS)]. There is no benefit, to me at least, to use +BGP as an IGP. +1. I would give each of the links between the switches an IPv4 /31 and enable link-local, and give +each switch a loopback address with a /32 IPv4 and a /128 IPv6. +1. If I had multiple links between two given switches, I would probably just use ECMP if my devices +supported it, and fall back to a LACP signaled bundle-ethernet otherwise. +1. If I were to need to use BGP (and for eVPN, this need exists), taking the ISP mindset (as opposed +to the datacenter fabric mindset), I would simply install iBGP against two or three route +reflectors, and exchange routing information within the same single AS number. + +### eVPN: A demo topology + +{{< image float="right" src="/assets/frys-ix/Nokia Arista VXLAN.svg" alt="Demo topology" width="50%" >}} + +So, that's exactly how I'm going to approach the FrysIX eVPN design: OSPF for the underlay and iBGP +for the overlay! I have a feeling that some folks will dispise me for being contrarian, but you can +leave your comments below, and don't forget to like-and-subscribe :-) + +Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two +400G-capable switches and connects them. Then he takes an Arista DCS-7060CX switch (which is eVPN +capable, with 32x100G ports, based on the Broadcom Tomahawk3 chipset), and a smaller Nokia +IXR-7220-D2 (with 48x25G and 8x100G ports, based on the Trident3 chipset). + +#### Underlay: Nokia's SR Linux + +We boot up the lab, verify that all the optics and links are up, and connect the management ports to +an OOB network that I can remotely log in to. This is the first time that either of us work on +Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek. + +``` +[linuxadmin@nikhef ~]$ sr_cli +--{ running }--[ ]-- +A:linuxadmin@nikhef# enter candidate +--{ candidate shared default }--[ ]-- +A:linuxadmin@nikhef# set / interface lo0 admin-state enable +A:linuxadmin@nikhef# set / interface lo0 subinterface 0 admin-state enable +A:linuxadmin@nikhef# set / interface lo0 subinterface 0 ipv4 admin-state enable +A:linuxadmin@nikhef# set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32 +A:linuxadmin@nikhef# commit stay +``` + +There, my first config snippet! This creates a _loopback_ interface, and similar to JunOS, a +_subinterface_ (which Juniper calls a _unit_) which enables IPv4 and gives it an /32 address. In SR +Linux, any interface has to be associated with a _network-instance_, think of those as routing +domains or VRFs. There's a conveniently named _default_ network-instance, which I'll add this and +the point-to-point interface between the two 400G routers to: + +``` +A:linuxadmin@nikhef# info flat interface ethernet-1/29 +set / interface ethernet-1/29 admin-state enable +set / interface ethernet-1/29 subinterface 0 admin-state enable +set / interface ethernet-1/29 subinterface 0 ip-mtu 9190 +set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable +set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31 +set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable + +A:linuxadmin@nikhef# set / network-instance default type default +A:linuxadmin@nikhef# set / network-instance default admin-state enable +A:linuxadmin@nikhef# set / network-instance default interface ethernet-1/29.0 +A:linuxadmin@nikhef# set / network-instance default interface lo0.0 +A:linuxadmin@nikhef# commit stay +``` + +Cool. Assuming I now also do this on the other IXR-7220-D4 router, called _equinix_ (which gets the +loopback address 198.19.16.0/32 and the point-to-point on the 400G interface of 198.19.17.0/31), I +should be able to do my first ping: + +``` +A:linuxadmin@equinix# ping network-instance default 198.19.17.1 -s 9162 -M do +Using network instance default +PING 198.19.17.1 (198.19.17.1) 9162(9190) bytes of data. +9170 bytes from 198.19.17.1: icmp_seq=1 ttl=64 time=0.466 ms +9170 bytes from 198.19.17.1: icmp_seq=2 ttl=64 time=0.477 ms +9170 bytes from 198.19.17.1: icmp_seq=3 ttl=64 time=0.547 ms +``` + +#### Underlay: SR Linux OSPF + +OK, let's get these two Nokia routers to speak OSPF, so that they can reach each others' loopbacks. +It's really easy: + +``` +A:linuxadmin@nikhef# / network-instance default protocols ospf instance default +--{ candidate shared default }--[ network-instance default protocols ospf instance default ]-- +A:linuxadmin@nikhef# set admin-state enable +A:linuxadmin@nikhef# set version ospf-v2 +A:linuxadmin@nikhef# set router-id 198.19.16.1 +A:linuxadmin@nikhef# set area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point +A:linuxadmin@nikhef# set area 0.0.0.0 interface lo0.0 passive true +A:linuxadmin@nikhef# commit stay +``` + +Similar to in JunOS, I can descend into a configuration scope (the first line goes into the +_network-instance_ called `default` and then the _protocols_ called `ospf`, and then the _instance_ +called `default`. Subsequent `set` commands operate at this scope. Once I commit this configuration +(on the nikhef router and also the equinix router, with its own router-id), OSPF shoots to life +immediately: + +``` +A:linuxadmin@nikhef# show network-instance default protocols ospf neighbor +========================================================================================= +Net-Inst default OSPFv2 Instance default Neighbors +========================================================================================= ++---------------------------------------------------------------------------------------+ +| Interface-Name Rtr Id State Pri RetxQ Time Before Dead | ++=======================================================================================+ +| ethernet-1/29.0 198.19.16.0 full 1 0 36 | ++---------------------------------------------------------------------------------------+ +----------------------------------------------------------------------------------------- +No. of Neighbors: 1 +========================================================================================= + +A:linuxadmin@nikhef# show network-instance default route-table all | more +IPv4 unicast route table of network instance default ++------------------+-----+------------+--------------+--------+----------+--------+------+-------------+-----------------+ +| Prefix | ID | Route Type | Route Owner | Active | Origin | Metric | Pref | Next-hop | Next-hop | +| | | | | | Network | | | (Type) | Interface | +| | | | | | Instance | | | | | ++==================+=====+============+==============+========+==========+========+======+=============+=================+ +| 198.19.16.0/32 | 0 | ospfv2 | ospf_mgr | True | default | 1 | 10 | 198.19.17.0 | ethernet-1/29.0 | +| | | | | | | | | (direct) | | +| 198.19.16.1/32 | 7 | host | net_inst_mgr | True | default | 0 | 0 | None | None | +| 198.19.17.0/31 | 6 | local | net_inst_mgr | True | default | 0 | 0 | 198.19.17.1 | ethernet-1/29.0 | +| | | | | | | | | (direct) | | +| 198.19.17.1/32 | 6 | host | net_inst_mgr | True | default | 0 | 0 | None | None | ++==================+=====+============+==============+========+==========+========+======+=============+=================+ + +A:linuxadmin@nikhef# ping network-instance default 198.19.16.0 +Using network instance default +PING 198.19.16.0 (198.19.16.0) 56(84) bytes of data. +64 bytes from 198.19.16.0: icmp_seq=1 ttl=64 time=0.484 ms +64 bytes from 198.19.16.0: icmp_seq=2 ttl=64 time=0.663 ms +``` + +Delicious! OSPF has learned the loopback, and it is now reachable. As with most things, going from 0 +to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going +from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on, +going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on +the Nikhef router, using ethernet-1/1.0 through ethernet-1/4.0 with the correct MTU and turning on OSPF +for these), makes the whole network shoot to life. + +#### Underlay: Arista + +I'll point out that one of the devices in this topology is an Arista. We have several of these ready +for deployment at FrysIX. They are a lot more affordable, come with 32x100G ports, and are really +good at packet slinging because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less +faetures than the _Trident_ chipset, but they happen to have all the features we need to run our +internet exchange . So I turn my attention to the Arista in the topology. I am much more comfortable +configuring the whole thing here, as it's not my first time touching these devices: + +``` +arista-leaf#show run int loop0 +interface Loopback0 + ip address 198.19.16.2/32 + ip ospf area 0.0.0.0 +arista-leaf#show run int Ethernet32/1 +interface Ethernet32/1 + description Core: Connected to nikhef:ethernet-1/2 + load-interval 1 + mtu 9190 + no switchport + ip address 198.19.17.5/31 + ip ospf cost 1000 + ip ospf network point-to-point + ip ospf area 0.0.0.0 +arista-leaf#show run section router ospf +router ospf 65500 + router-id 198.19.16.2 + redistribute connected + network 198.19.0.0/16 area 0.0.0.0 + max-lsa 12000 +``` + +I complete the configuration for the other two core ports on this Arista, port Eth31/1 connects also +to the nikhef IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to +the nokia-leaf IXR-7220-D2 with a cost of 10. +It's nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via +router-id 198.19.16.1 (nikhef), and there's one lower cost path via router-id 198.19.16.3 +(nokia-leaf). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia -> +equinix). +``` +arista-leaf#show ip ospf nei +Neighbor ID Instance VRF Pri State Dead Time Address Interface +198.19.16.1 65500 default 1 FULL 00:00:36 198.19.17.4 Ethernet32/1 +198.19.16.3 65500 default 1 FULL 00:00:31 198.19.17.11 Ethernet30/1 +198.19.16.1 65500 default 1 FULL 00:00:35 198.19.17.2 Ethernet31/1 + +arista-leaf#traceroute 198.19.16.0 +traceroute to 198.19.16.0 (198.19.16.0), 30 hops max, 60 byte packets + 1 198.19.17.11 (198.19.17.11) 0.220 ms 0.150 ms 0.206 ms + 2 198.19.17.6 (198.19.17.6) 0.169 ms 0.107 ms 0.099 ms + 3 198.19.16.0 (198.19.16.0) 0.434 ms 0.346 ms 0.303 ms +``` + +So far, so good! The _underlay_ is up, every router can reach every other router on its loopback, +and all OSPF adjacencies are formed. I'll leave the 2x100G between _nikhef_ and _arista-leaf_ at +high cost for now. + +#### Overlay EVPN: SR Linux + +The big-idea here is to use iBGP with the same AS number, and because there are two main facilities +(NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as route-reflectors for +others. It means that they will have an iBGP session amongst themselves (198.191.16.0 <-> +198.19.16.1) and otherwise accept iBGP sessions from any IP address in the 198.19.16.0/24 subnet. +This way, I don't have to configure any more than strictly necessary on the core routers. Any new +router can just plug in, form an OSPF adjacency, and connect to both core routers. I proceed to +configure the Nokia's like this: +``` +A:linuxadmin@nikhef# / network-instance default protocols bgp +A:linuxadmin@nikhef# set admin-state enable +A:linuxadmin@nikhef# set autonomous-system 65500 +A:linuxadmin@nikhef# set router-id 198.19.16.1 +A:linuxadmin@nikhef# set dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay +A:linuxadmin@nikhef# set afi-safi evpn admin-state enable +A:linuxadmin@nikhef# set preference ibgp 170 +A:linuxadmin@nikhef# set route-advertisement rapid-withdrawal true +A:linuxadmin@nikhef# set route-advertisement wait-for-fib-install false +A:linuxadmin@nikhef# set group overlay peer-as 65500 +A:linuxadmin@nikhef# set group overlay afi-safi evpn admin-state enable +A:linuxadmin@nikhef# set group overlay afi-safi ipv4-unicast admin-state disable +A:linuxadmin@nikhef# set group overlay afi-safi ipv6-unicast admin-state disable +A:linuxadmin@nikhef# set group overlay local-as as-number 65500 +A:linuxadmin@nikhef# set group overlay route-reflector client true +A:linuxadmin@nikhef# set group overlay transport local-address 198.19.16.1 +A:linuxadmin@nikhef# set neighbor 198.19.16.0 admin-state enable +A:linuxadmin@nikhef# set neighbor 198.19.16.0 peer-group overlay +A:linuxadmin@nikhef# commit stay +``` + +I can see that iBGP sessions establish between all the devices: + +``` +A:linuxadmin@nikhef# show network-instance default protocols bgp neighbor +--------------------------------------------------------------------------------------------------------------------------- +BGP neighbor summary for network-instance "default" +Flags: S static, D dynamic, L discovered by LLDP, B BFD enabled, - disabled, * slow +--------------------------------------------------------------------------------------------------------------------------- +--------------------------------------------------------------------------------------------------------------------------- ++-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+ +| Net-Inst | Peer | Group | Flags | Peer-AS | State | Uptime | AFI/SAFI | [Rx/Active/Tx] | ++=============+=============+==========+=======+==========+=============+===============+============+====================+ +| default | 198.19.16.0 | overlay | S | 65500 | established | 0d:0h:2m:32s | evpn | [0/0/0] | +| default | 198.19.16.2 | overlay | D | 65500 | established | 0d:0h:2m:27s | evpn | [0/0/0] | +| default | 198.19.16.3 | overlay | D | 65500 | established | 0d:0h:2m:41s | evpn | [0/0/0] | ++-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+ +--------------------------------------------------------------------------------------------------------------------------- +Summary: +1 configured neighbors, 1 configured sessions are established, 0 disabled peers +2 dynamic peers +``` + +A few things to note here - there one _configured_ neighbor (this is the other IXR-7220-D4 router), +and two _dynamic_ peers, these are the Arista and the smaller IXR-7220-D2 router. The only address +family that they are exchanging information for is the _evpn_ family, and no prefixes have been +learned or sent het (that's the `[0/0/0]` designation in the last column). + +#### Overlay EVPN: Arista + +The Arista is also remarkably straight forward to configure. Here, I'll simply enable the iBGP +session as follows: + +``` +arista-leaf#show run section bgp +router bgp 65500 + neighbor evpn peer group + neighbor evpn remote-as 65500 + neighbor evpn update-source Loopback0 + neighbor evpn ebgp-multihop 3 + neighbor evpn send-community extended + neighbor evpn maximum-routes 12000 warning-only + neighbor 198.19.16.0 peer group evpn + neighbor 198.19.16.1 peer group evpn + ! + address-family evpn + neighbor evpn activate + +arista-leaf#show bgp summary +BGP summary information for VRF default +Router identifier 198.19.16.2, local AS number 65500 +Neighbor AS Session State AFI/SAFI AFI/SAFI State NLRI Rcd NLRI Acc +----------- ----------- ------------- ----------------------- -------------- ---------- ---------- +198.19.16.0 65500 Established IPv4 Unicast Advertised 0 0 +198.19.16.0 65500 Established L2VPN EVPN Negotiated 0 0 +198.19.16.1 65500 Established IPv4 Unicast Advertised 0 0 +198.19.16.1 65500 Established L2VPN EVPN Negotiated 0 0 +``` + +On this leaf node, I'll have a redundant iBGP session with the two core nodes. Since those core +nodes are peering amongst themselves, and are configured as route-reflectors, this is all I need. No +matter how many additional Arista (or Nokia) devices I add to the network, all they'll have to do is +enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sesions with both. Voila! + +#### VXLAN EVPN: SR Linux +Nokia informs me that it uses a special interface called _system0_ to source its VXLAN traffic from. +So it's a matter of defining that interface and associate a VXLAN interface with it, like so: + +``` +A:linuxadmin@nikhef# set / interface system0 admin-state enable +A:linuxadmin@nikhef# set / interface system0 subinterface 0 admin-state enable +A:linuxadmin@nikhef# set / interface system0 subinterface 0 ipv4 admin-state enable +A:linuxadmin@nikhef# set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32 +A:linuxadmin@nikhef# set / network-instance default interface system0.0 +A:linuxadmin@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged +A:linuxadmin@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604 +A:linuxadmin@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address +A:linuxadmin@nikhef# commit stay +``` + +This creates the plumbing for a VXLAN sub-interface called `vxlan1.2604` which will accept/send +traffic using VNI 2604 (this happens to be the VLAN id we use at FrysIX for our production Peering +LAN), and it'll use the `system0.0` address to source that traffic from. + +The second part is to create what SR Linux calls a MAC-VRF and put some interface in it: + +``` +A:linuxadmin@nikhef# set / interface ethernet-1/9 admin-state enable +A:linuxadmin@nikhef# set / interface ethernet-1/9 breakout-mode num-breakout-ports 4 +A:linuxadmin@nikhef# set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G +A:linuxadmin@nikhef# set / interface ethernet-1/9/3 admin-state enable +A:linuxadmin@nikhef# set / interface ethernet-1/9/3 vlan-tagging true +A:linuxadmin@nikhef# set / interface ethernet-1/9/3 subinterface 0 type bridged +A:linuxadmin@nikhef# set / interface ethernet-1/9/3 subinterface 0 admin-state enable +A:linuxadmin@nikhef# set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged + +A:linuxadmin@nikhef# / network-instance peeringlan +A:linuxadmin@nikhef# set type mac-vrf +A:linuxadmin@nikhef# set admin-state enable +A:linuxadmin@nikhef# set interface ethernet-1/9/3.0 +A:linuxadmin@nikhef# set vxlan-interface vxlan1.2604 +A:linuxadmin@nikhef# set protocols bgp-evpn bgp-instance 1 admin-state enable +A:linuxadmin@nikhef# set protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604 +A:linuxadmin@nikhef# set protocols bgp-evpn bgp-instance 1 evi 2604 +A:linuxadmin@nikhef# set protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604 +A:linuxadmin@nikhef# set protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604 +A:linuxadmin@nikhef# set protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604 +A:linuxadmin@nikhef# commit stay +``` + +In the first block here, I take what is a 100G port called `ethernet-1/9` and I split it into 4x25G +ports. I'll force the port speed to 10G because Arend has taken a 40G-4x10G DAC, and it happens that +the third lane is plugged into the Debian machine. So on `ethernet-1/9/3` I'll create a +sub-interface, make it type _bridged_ (which I've also done on `vxlan1.2604`!) and allow any +untagged traffic to enter it. + +{{< image width="5em" float="left" src="/assets/shared/brain.png" alt="brain" >}} + +If you, like me, are used to either VPP or IOS/XR, this type of sub-interface stuff should feel very +natural to you. I've written about the sub-interfaces logic on Cisco's IOS/XR and VPP approach in a +previous [[article]({{< ref 2022-02-14-vpp-vlan-gym.md >}})] which my buddy Fred lovingly calls +_VLAN Gymnastics_ because the ports are just so damn flexible. Worth a read! + +The second block creates a new _network-instance_ which I'll name `peeringlan`, and it associates +the newly crated untagged sub-interface ethernet-1/9/3.0 with with the VXLAN interface, and starts a +protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the +VXLAN interface, and signalling of all MAC addresses learned to use route-distinguisher and +import/export route-targets. For simplicity I've just used the same for each: 65500:2604. + +I continue to add an interface to the `peeringlan` _network-instance_ on the other two Nokia +routers: `ethernet-1/9/3.0` on the equinix router and `ethernet-1/9.0` on the nokia-leaf router. +Each of these goes to a 10Gbps port on a Debian machine. + +#### VXLAN EVPN: Arista + +At this point I'm feeling pretty bullish about the whole project. Arista does not make it very +difficult on me to configure it for L2 EVPN (which is called MAC-VRF here also): + +``` +arista-leaf#conf t +vlan 2604 + name v-peeringlan +interface Ethernet9/3 + speed forced 10000full + switchport access vlan 2604 + +interface Loopback1 + ip address 198.19.18.2/32 +interface Vxlan1 + vxlan source-interface Loopback1 + vxlan udp-port 4789 + vxlan vlan 2604 vni 2604 +``` + +After creating VLAN 2604 on making port Eth9/3 an access port in that VLAN, I'll add a VTEP endpoint +called `Loopback1`, and a VXLAN interface that uses that to source its traffic. Here, I'll associate +local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias. + +Finally, it's a matter of tying these together by announcing the MAC addresses into the EVPN iBGP +sessions: +``` +arista-leaf#conf t +router bgp 65500 + vlan 2604 + rd 65500:2604 + route-target both 65500:2604 + redistribute learned + ! +``` + +### Results + +To validate the configurations, I learn a cool trick from my buddy Andy on the SR Linux discord +server. In EOS, I can ask it to check for any obvious mistakes in two places: + +``` +arista-leaf#show vxlan config-sanity detail +Category Result Detail +---------------------------------- -------- -------------------------------------------------- +Local VTEP Configuration Check OK + Loopback IP Address OK + VLAN-VNI Map OK + Flood List OK + Routing OK + VNI VRF ACL OK + Decap VRF-VNI Map OK + VRF-VNI Dynamic VLAN OK +Remote VTEP Configuration Check OK + Remote VTEP OK +Platform Dependent Check OK + VXLAN Bridging OK + VXLAN Routing OK VXLAN Routing not enabled +CVX Configuration Check OK + CVX Server OK Not in controller client mode +MLAG Configuration Check OK Run 'show mlag config-sanity' to verify MLAG config + Peer VTEP IP OK MLAG peer is not connected + MLAG VTEP IP OK + Peer VLAN-VNI OK + Virtual VTEP IP OK + MLAG Inactive State OK + +arista-leaf#show bgp evpn sanity detail +Category Check Status Detail +-------- -------------------- ------ ------ +General Send community OK +General Multi-agent mode OK +General Neighbor established OK +L2 MAC-VRF route-target OK + import and export +L2 MAC-VRF OK + route-distinguisher +L2 MAC-VRF redistribute OK +L2 MAC-VRF overlapping OK + VLAN +L2 Suppressed MAC OK +VXLAN VLAN to VNI map for OK + MAC-VRF +VXLAN VRF to VNI map for OK + IP-VRF +``` + +#### Results: Arista view + +Inspecting the MAC addresses learned from all four of the client ports on the Debian machine is +easy: + +``` +arista-leaf#show bgp evpn summary +BGP summary information for VRF default +Router identifier 198.19.16.2, local AS number 65500 +Neighbor Status Codes: m - Under maintenance + Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc + 198.19.16.0 4 65500 3311 3867 0 0 18:06:28 Estab 7 7 + 198.19.16.1 4 65500 3308 3873 0 0 18:06:28 Estab 7 7 + +arista-leaf#show bgp evpn vni 2604 next-hop 198.19.18.3 +BGP routing table information for VRF default +Router identifier 198.19.16.2, local AS number 65500 +Route status codes: * - valid, > - active, S - Stale, E - ECMP head, e - ECMP + c - Contributing to ECMP, % - Pending BGP convergence +Origin codes: i - IGP, e - EGP, ? - incomplete +AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop + + Network Next Hop Metric LocPref Weight Path + * >Ec RD: 65500:2604 mac-ip e43a.6e5f.0c59 + 198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1 + * ec RD: 65500:2604 mac-ip e43a.6e5f.0c59 + 198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0 + * >Ec RD: 65500:2604 imet 198.19.18.3 + 198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1 + * ec RD: 65500:2604 imet 198.19.18.3 + 198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0 +``` +There's a lot to unpack here! The Arista is seeing that from the _route-discriminator_ I configured +on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for +the nokia-leaf router) from both iBGP sessions. The MAC address is learned from originator +198.19.16.3 (the loopback of the nokia-leaf router), from two cluster members, the _active_ one on +iBGP speaker 198.19.16.1 (nikhef) and a backup member on 198.19.16.0 (equinix). + +I can also see that there's a bunch of `imet` route entries, and Andy explained these to me. They are +a signal from a VTEP participant that they are interested in seeing multicast traffic (like neighbor +discovery or ARP requests) flooded to them. Every router participating in this L2VPN will raise such +an `imet` route, which I'll see in duplicates as well (one from each iBGP session). This checks out. + +#### Results: SR Linux view + +The Nokia IXR-7220-D4 router called _equinix_ has also learned a bunch of EVPN routing entries, +which I can inspect as follows: + +``` +A:linuxadmin@equinix# show network-instance default protocols bgp routes evpn route-type summary +-------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Show report for the BGP route table of network-instance "default" +-------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Status codes: u=used, *=valid, >=best, x=stale, b=backup +Origin codes: i=IGP, e=EGP, ?=incomplete +-------------------------------------------------------------------------------------------------------------------------------------------------------------------- +BGP Router ID: 198.19.16.0 AS: 65500 Local AS: 65500 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Type 2 MAC-IP Advertisement Routes ++--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+ +| Status | Route- | Tag-ID | MAC-address | IP-address | neighbor | Path-| Next-Hop | Label | ESI | MAC Mobility | +| | distinguisher | | | | | id | | | | | ++========+===============+========+===================+============+=============+======+============-+========+================================+==================+ +| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:57 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.1 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - | +| * | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - | +| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.2 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - | +| * | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - | +| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.3 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - | ++--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+ +-------------------------------------------------------------------------------------------------------------------------------------------------------------------- +Type 3 Inclusive Multicast Ethernet Tag Routes ++--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+ +| Status | Route-distinguisher | Tag-ID | Originator-IP | neighbor | Path- | Next-Hop | +| | | | | | id | | ++========+=============================+========+=====================+=================+========+=======================+ +| u*> | 65500:2604 | 0 | 198.19.18.1 | 198.19.16.1 | 0 | 198.19.18.1 | +| * | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.1 | 0 | 198.19.18.2 | +| u*> | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.2 | 0 | 198.19.18.2 | +| * | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.1 | 0 | 198.19.18.3 | +| u*> | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.3 | 0 | 198.19.18.3 | ++--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+ +-------------------------------------------------------------------------------------------------------------------------- +0 Ethernet Auto-Discovery routes 0 used, 0 valid +5 MAC-IP Advertisement routes 3 used, 5 valid +5 Inclusive Multicast Ethernet Tag routes 3 used, 5 valid +0 Ethernet Segment routes 0 used, 0 valid +0 IP Prefix routes 0 used, 0 valid +0 Selective Multicast Ethernet Tag routes 0 used, 0 valid +0 Selective Multicast Membership Report Sync routes 0 used, 0 valid +0 Selective Multicast Leave Sync routes 0 used, 0 valid +-------------------------------------------------------------------------------------------------------------------------- +``` + +I have to say, SR Linux is incredibly chatty! But, I can see all the relevant bits and bobs here. +Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch, one +pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the IMET +entries. One thing to note -- the SR Linux implementation leaves the type-2 routes empty with a +0.0.0.0 IPv4 address, while the Arista (in my opinion, more correctly) leaves them as NULL +(unspecified). But, everything looks great! + +#### Results: Debian view + +There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. Arend +hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+ connections. +This network card is a regular in my AS8298 network, as it has excellent DPDK support and can pump +easily 40Mpps with VPP. IPng 🥰 Intel X710! + +``` +root@debian:~ # ip netns add nikhef +root@debian:~ # ip link set enp1s0f0 netns nikhef +root@debian:~ # ip netns exec nikhef ip link set enp1s0f0 up mtu 9000 +root@debian:~ # ip netns exec nikhef ip addr add 192.0.2.10/24 dev enp1s0f0 +root@debian:~ # ip netns exec nikhef ip addr add 2001:db8::10/64 dev enp1s0f0 + +root@debian:~ # ip netns add arista-leaf +root@debian:~ # ip link set enp1s0f1 netns arista-leaf +root@debian:~ # ip netns exec arista-leaf ip link set enp1s0f1 up mtu 9000 +root@debian:~ # ip netns exec arista-leaf ip addr add 192.0.2.11/24 dev enp1s0f1 +root@debian:~ # ip netns exec arista-leaf ip addr add 2001:db8::11/64 dev enp1s0f1 + +root@debian:~ # ip netns add nokia-leaf +root@debian:~ # ip link set enp1s0f2 netns nokia-leaf +root@debian:~ # ip netns exec nokia-leaf ip link set enp1s0f2 up mtu 9000 +root@debian:~ # ip netns exec nokia-leaf ip addr add 192.0.2.12/24 dev enp1s0f2 +root@debian:~ # ip netns exec nokia-leaf ip addr add 2001:db8::12/64 dev enp1s0f2 + +root@debian:~ # ip netns add equinix +root@debian:~ # ip link set enp1s0f3 netns equinix +root@debian:~ # ip netns exec equinix ip link set enp1s0f3 up mtu 9000 +root@debian:~ # ip netns exec equinix ip addr add 192.0.2.13/24 dev enp1s0f3 +root@debian:~ # ip netns exec equinix ip addr add 2001:db8::13/64 dev enp1s0f3 + +root@debian:~# ip netns exec nikhef fping -g 192.0.2.8/29 +192.0.2.10 is alive +192.0.2.11 is alive +192.0.2.12 is alive +192.0.2.13 is alive + +root@debian:~# ip netns exec arista-leaf fping 2001:db8::10 2001:db8::11 2001:db8::12 2001:db8::13 +2001:db8::10 is alive +2001:db8::11 is alive +2001:db8::12 is alive +2001:db8::13 is alive + +root@debian:~# ip netns exec equinix ip nei +192.0.2.10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE +192.0.2.11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE +192.0.2.12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE +fe80::e63a:6eff:fe5f:c57 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE +fe80::e63a:6eff:fe5f:c58 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE +fe80::e63a:6eff:fe5f:c59 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE +2001:db8::10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE +2001:db8::11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE +2001:db8::12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE +``` + +The Debian machine puts each network card into its own network namespace, and gives it both an IPv4 +and an IPv6 address. I can then enter the `nikhef` network namespace, which has its NIC connected to +the IXR-7220-D4 router called _nikhef_, and ping all four endpoints. Similarly, I can enter the +`arista-leaf` namespace and ping6 all four endpoints. Finally, I take a look at the IPv6 and IPv4 +neighbor table on the network card that is connected to the Equinix router. All three MAC addresses are +seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability. + +Performance? We got that! +``` +root@debian:~# ip netns exec equinix iperf3 -c 192.0.2.12 +Connecting to host 192.0.2.12, port 5201 +[ 5] local 192.0.2.10 port 34598 connected to 192.0.2.12 port 5201 +[ ID] Interval Transfer Bitrate Retr Cwnd +[ 5] 0.00-1.00 sec 1.15 GBytes 9.91 Gbits/sec 19 1.52 MBytes +[ 5] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 3 1.54 MBytes +[ 5] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes +[ 5] 3.00-4.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes +[ 5] 4.00-5.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes +[ 5] 5.00-6.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes +[ 5] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes +[ 5] 7.00-8.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes +[ 5] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes +[ 5] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes +- - - - - - - - - - - - - - - - - - - - - - - - - +[ ID] Interval Transfer Bitrate Retr +[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 24 sender +[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver + +iperf Done. +``` + +## What's Next + +There's a few improvements I can make before deploying this architecture to the internet exchange. +Notably: +* the functional equivalent of _port security_, that is to say only allowing one or two MAC + addresses per member port. FrysIX has a strict one-port-one-member-one-MAC rule, and having port + security will greatly improve our resilience. +* SR Linux has the ability to suppress ARP, _even on L2 MAC-VRF_! It's relatively well known for + IRB based setups, but adding this to transparent bridge-domains is possible in Nokia +[[ref](https://documentation.nokia.com/srlinux/22-6/SR_Linux_Book_Files/EVPN-VXLAN_Guide/services-evpn-vxlan-l2.html#configuring_evpn_learning_for_proxy_arp)], + using the syntax of `protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise +true`. This will glean the IP addresses based on intercepted ARP requests, and reduce the need for + BUM flooding. If DE-CIX can do it, so can FrysIX :) +* some automation - although configuring the MAC-VRF across Arista and SR Linux is definitely not + as difficult as I thought, having some automation in place will avoid errors and mistakes. It + would suck if the IXP collapsed because I botched a link drain or PNI configuration! + + +### Acknowledgements + +I am relatively new to EVPN configurations, and wanted to give a shoutout to Andy Whitaker who +jumped in very quickly when I asked a question on the SR Linux Discord. He was gracious with his +time and spent a few hours on a video call with me, explaining EVPN in great detail both for Arista +as well as SR Linux, and in particular wanted to give a big "Thank you!" for helping me understand +symmetric and asymmetric IRB in the context of multivendor EVPN. Andy is about to start a new job at +Nokia, and I wish him all the best. To my friends at Nokia: you caught a good one, Andy is pure +gold! diff --git a/static/assets/frys-ix/FrysIX_ Topology (concept).svg b/static/assets/frys-ix/FrysIX_ Topology (concept).svg new file mode 100644 index 0000000..2765d6a --- /dev/null +++ b/static/assets/frys-ix/FrysIX_ Topology (concept).svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/static/assets/frys-ix/IXR-7220-D3.jpg b/static/assets/frys-ix/IXR-7220-D3.jpg new file mode 100644 index 0000000..12ade62 --- /dev/null +++ b/static/assets/frys-ix/IXR-7220-D3.jpg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4c5d62f6f9897d8d1a6c55f965ab1f4b1096492381eb6cccdc897ebaf2c398ca +size 2584711 diff --git a/static/assets/frys-ix/Nokia Arista VXLAN.svg b/static/assets/frys-ix/Nokia Arista VXLAN.svg new file mode 100644 index 0000000..5ba9af3 --- /dev/null +++ b/static/assets/frys-ix/Nokia Arista VXLAN.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/static/assets/frys-ix/frysix-logo-small.png b/static/assets/frys-ix/frysix-logo-small.png new file mode 100644 index 0000000..95b2083 --- /dev/null +++ b/static/assets/frys-ix/frysix-logo-small.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:688ea24d39275eb106edb9059cf12af2ad237cc09e437a1b4172f194df5bbe56 +size 106447 diff --git a/static/assets/frys-ix/nokia-7220-d2.png b/static/assets/frys-ix/nokia-7220-d2.png new file mode 100644 index 0000000..c5a057f --- /dev/null +++ b/static/assets/frys-ix/nokia-7220-d2.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b6be06505850b71bf144ff421b6007ccb8f329048e3d9d6b8fd16fabfec0eecd +size 592609 diff --git a/static/assets/frys-ix/nokia-7220-d4.png b/static/assets/frys-ix/nokia-7220-d4.png new file mode 100644 index 0000000..b28b90a --- /dev/null +++ b/static/assets/frys-ix/nokia-7220-d4.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c5a8e1bc735332b06f6e43d725e28d5d44539813f5e22cf9147ac0bd80824dbc +size 584648