From 3b7e576d20ae60cc61b04e2afb0572dd4a25fb7b Mon Sep 17 00:00:00 2001 From: Pim van Pelt Date: Thu, 10 Apr 2025 01:16:19 -0500 Subject: [PATCH] Typo and readability --- content/articles/2025-04-09-frysix-evpn.md | 84 +++++++++++----------- 1 file changed, 44 insertions(+), 40 deletions(-) diff --git a/content/articles/2025-04-09-frysix-evpn.md b/content/articles/2025-04-09-frysix-evpn.md index b0572a9..5b6f052 100644 --- a/content/articles/2025-04-09-frysix-evpn.md +++ b/content/articles/2025-04-09-frysix-evpn.md @@ -76,7 +76,7 @@ notably they don't have to do provider edge functionality like VXLAN encap and d Almost all of these designs are showing how one might build a leaf-spine network for hyperscale. **Critique 1**: my 'spine' (IXR-7220-D4 routers) must also be provider edge. Practically speaking, -in the picture above I have these beautiful Nokia IXR-7220-D4 switches, using two 400G ports to +in the picture above I have these beautiful Nokia IXR-7220-D4 routers, using two 400G ports to connect between the facilities, and six 100G ports to connect the smaller breakout switches. That would leave a _massive_ amount of capacity unused: 22x 100G and 6x400G ports, to be exact. @@ -88,14 +88,15 @@ It's much more economical to create a star-topology that minimizes cross-datacen **Critique 3**: Most of these 'spine-leaf' reference architectures assume that the interior gateway protocol is eBGP in what they call the _underlay_, and on top of that, some secondary eBGP that's called the _overlay_. Frankly, such a design makes my head spin a little bit. These designs assume -hundreds of switches, in which case making use of one AS number per switch could make sense (as iBGP -needs either a 'full mesh', or external route reflectors). +hundreds of switches, in which case making use of one AS number per switch could make sense, as iBGP +needs either a 'full mesh', or external route reflectors. **Critique 4**: These reference designs also make an assumption that all fiber is local and while -links can fail, it will be relatively rare to _drain_ a link. However, in cross-datacenter networks, -draining links for maintenance is very common, for example if the dark fiber provider needs to -perform maintenance. With these eBGP-over-eBGP connections, traffic engineering is more difficult -than simply raising the OSPF (or IS-IS) cost of a link, to reroute traffic. +optics and links can fail, it will be relatively rare to _drain_ a link. However, in +cross-datacenter networks, draining links for maintenance is very common, for example if the dark +fiber provider needs to perform repairs on a span that was damaged. With these eBGP-over-eBGP +connections, traffic engineering is more difficult than simply raising the OSPF (or IS-IS) cost of a +link, to reroute traffic. Setting aside eVPN for a second, if I were to build an IP transport network, like I did when I built [[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a much more intuitive @@ -121,16 +122,16 @@ for the overlay! I have a feeling that some folks will dispise me for being cont leave your comments below, and don't forget to like-and-subscribe :-) Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two -400G-capable switches and connects them. Then he takes an Arista DCS-7060CX switch (which is eVPN -capable, with 32x100G ports, based on the Broadcom Tomahawk3 chipset), and a smaller Nokia -IXR-7220-D2 (with 48x25G and 8x100G ports, based on the Trident3 chipset). He wires all of this up +400G-capable routers and connects them. Then he takes an Arista DCS-7060CX switch, which is eVPN +capable, with 32x100G ports, based on the Broadcom Tomahawk3 chipset, and a smaller Nokia +IXR-7220-D2 with 48x25G and 8x100G ports, based on the Trident3 chipset. He wires all of this up to look like the picture on the right. #### Underlay: Nokia's SR Linux -We boot up the lab, verify that all the optics and links are up, and connect the management ports to -an OOB network that I can remotely log in to. This is the first time that either of us work on -Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek. +We boot up the equipment, verify that all the optics and links are up, and connect the management +ports to an OOB network that I can remotely log in to. This is the first time that either of us work +on Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek. ``` [pim@nikhef ~]$ sr_cli @@ -181,7 +182,7 @@ PING 198.19.17.1 (198.19.17.1) 9162(9190) bytes of data. #### Underlay: SR Linux OSPF -OK, let's get these two Nokia routers to speak OSPF, so that they can reach each others' loopbacks. +OK, let's get these two Nokia routers to speak OSPF, so that they can reach each other's loopback. It's really easy: ``` @@ -195,11 +196,11 @@ A:pim@nikhef# set area 0.0.0.0 interface lo0.0 passive true A:pim@nikhef# commit stay ``` -Similar to in JunOS, I can descend into a configuration scope (the first line goes into the +Similar to in JunOS, I can descend into a configuration scope: the first line goes into the _network-instance_ called `default` and then the _protocols_ called `ospf`, and then the _instance_ called `default`. Subsequent `set` commands operate at this scope. Once I commit this configuration -(on the _nikhef_ router and also the _equinix_ router, with its own unique router-id), OSPF shoots -to life immediately: +(on the _nikhef_ router and also the _equinix_ router, with its own unique router-id), OSPF quickly +shoots in action: ``` A:pim@nikhef# show network-instance default protocols ospf neighbor @@ -241,8 +242,8 @@ Delicious! OSPF has learned the loopback, and it is now reachable. As with most to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on, going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on -the _nikhef_ router, using ethernet-1/1.0 through ethernet-1/4.0 with the correct MTU and turning on OSPF -for these), makes the whole network shoot to life. Slick! +the _nikhef_ router, using `ethernet-1/1.0` through `ethernet-1/4.0` with the correct MTU and +turning on OSPF for these), makes the whole network shoot to life. Slick! #### Underlay: Arista @@ -277,13 +278,14 @@ router ospf 65500 max-lsa 12000 ``` -I complete the configuration for the other two core ports on this Arista, port Eth31/1 connects also +I complete the configuration for the other two interfaces on this Arista, port Eth31/1 connects also to the _nikhef_ IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to the _nokia-leaf_ IXR-7220-D2 with a cost of 10. It's nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via router-id 198.19.16.1 (nikhef), and there's one lower cost path via router-id 198.19.16.3 -(nokia-leaf). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia -> -equinix). +(_nokia-leaf_). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia -> +equinix). Dope! + ``` arista-leaf#show ip ospf nei Neighbor ID Instance VRF Pri State Dead Time Address Interface @@ -304,13 +306,14 @@ high cost for now. #### Overlay EVPN: SR Linux -The big-picture idea here is to use iBGP with the same AS number, and because there are two main -facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as +The big-picture idea here is to use iBGP with the same private AS number, and because there are two +main facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as route-reflectors for others. It means that they will have an iBGP session amongst themselves (198.191.16.0 <-> 198.19.16.1) and otherwise accept iBGP sessions from any IP address in the 198.19.16.0/24 subnet. This way, I don't have to configure any more than strictly necessary on the core routers. Any new router can just plug in, form an OSPF adjacency, and connect to both core routers. I proceed to configure BGP on the Nokia's like this: + ``` A:pim@nikhef# / network-instance default protocols bgp A:pim@nikhef# set admin-state enable @@ -358,7 +361,7 @@ Summary: A few things to note here - there one _configured_ neighbor (this is the other IXR-7220-D4 router), and two _dynamic_ peers, these are the Arista and the smaller IXR-7220-D2 router. The only address family that they are exchanging information for is the _evpn_ family, and no prefixes have been -learned or sent yet (that's the `[0/0/0]` designation in the last column). +learned or sent yet, shown by the `[0/0/0]` designation in the last column. #### Overlay EVPN: Arista @@ -400,7 +403,7 @@ Voila! #### VXLAN EVPN: SR Linux Nokia documentation informs me that SR Linux uses a special interface called _system0_ to source its -VXLAN traffic from, and add the interface to the _default_ network-instance. So it's a matter of +VXLAN traffic from, and to add this interface to the _default_ network-instance. So it's a matter of defining that interface and associate a VXLAN interface with it, like so: ``` @@ -459,10 +462,11 @@ previous [[article]({{< ref 2022-02-14-vpp-vlan-gym.md >}})] which my buddy Fred _VLAN Gymnastics_ because the ports are just so damn flexible. Worth a read! The second block creates a new _network-instance_ which I'll name `peeringlan`, and it associates -the newly crated untagged sub-interface `ethernet-1/9/3.0` with with the VXLAN interface, and starts a +the newly crated untagged sub-interface `ethernet-1/9/3.0` with the VXLAN interface, and starts a protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the -VXLAN interface, and signalling of all MAC addresses learned to use route-distinguisher and -import/export route-targets. For simplicity I've just used the same for each: 65500:2604. +VXLAN sub-interface, and signalling of all MAC addresses learned to use the specified +route-distinguisher and import/export route-targets. For simplicity I've just used the same for +each: 65500:2604. I continue to add an interface to the `peeringlan` _network-instance_ on the other two Nokia routers: `ethernet-1/9/3.0` on the _equinix_ router and `ethernet-1/9.0` on the _nokia-leaf_ router. @@ -592,7 +596,7 @@ AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Li There's a lot to unpack here! The Arista is seeing that from the _route-distinguisher_ I configured on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for the _nokia-leaf_ router) from both iBGP sessions. The MAC address is learned from originator -198.19.16.3 (the loopback of the nokia-leaf router), from two cluster members, the _active_ one on +198.19.16.3 (the loopback of the _nokia-leaf_ router), from two cluster members, the active one on iBGP speaker 198.19.16.1 (_nikhef_) and a backup member on 198.19.16.0 (_equinix_). I can also see that there's a bunch of `imet` route entries, and Andy explained these to me. They are @@ -650,9 +654,9 @@ Type 3 Inclusive Multicast Ethernet Tag Routes -------------------------------------------------------------------------------------------------------------------------- ``` -I have to say, SR Linux is incredibly chatty! But, I can see all the relevant bits and bobs here. -Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch, one -pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the IMET +I have to say, SR Linux output is incredibly verbose! But, I can see all the relevant bits and bobs +here. Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch, +one pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the `imet` entries. One thing to note -- the SR Linux implementation leaves the type-2 routes empty with a 0.0.0.0 IPv4 address, while the Arista (in my opinion, more correctly) leaves them as NULL (unspecified). But, everything looks great! @@ -713,14 +717,14 @@ fe80::e63a:6eff:fe5f:c59 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE 2001:db8::12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE ``` -The Debian machine puts each network card into its own network namespace, and gives it both an IPv4 +The Debian machine puts each network card into its own network namespace, and gives them both an IPv4 and an IPv6 address. I can then enter the `nikhef` network namespace, which has its NIC connected to the IXR-7220-D4 router called _nikhef_, and ping all four endpoints. Similarly, I can enter the `arista-leaf` namespace and ping6 all four endpoints. Finally, I take a look at the IPv6 and IPv4 -neighbor table on the network card that is connected to the Equinix router. All three MAC addresses are -seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability. +neighbor table on the network card that is connected to the _equinix_ router. All three MAC addresses are +seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability. Booyah! -Performance? We got that! +Performance? We got that! I'm not worried as these Nokia routers are rated for 12.8Tbps of VXLAN.... ``` root@debian:~# ip netns exec equinix iperf3 -c 192.0.2.12 Connecting to host 192.0.2.12, port 5201 @@ -757,9 +761,9 @@ Notably: using the syntax of `protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true`. This will glean the IP addresses based on intercepted ARP requests, and reduce the need for BUM flooding. -* Andy informs me that Arista also has this feature. By setting 'router l2-vpn' and 'arp learning bridged', - the suppression of ARP requests/replies also works in the same way. This greatly reduces cross - router BUM flooding. If DE-CIX can do it, so can FrysIX :) +* Andy informs me that Arista also has this feature. By setting `router l2-vpn` and `arp learning bridged`, + the suppression of ARP requests/replies also works in the same way. This greatly reduces cross-router + BUM flooding. If DE-CIX can do it, so can FrysIX :) * some automation - although configuring the MAC-VRF across Arista and SR Linux is definitely not as difficult as I thought, having some automation in place will avoid errors and mistakes. It would suck if the IXP collapsed because I botched a link drain or PNI configuration!