From 8a991bee47322d01ebbc149cf42c1223eacd5318 Mon Sep 17 00:00:00 2001 From: Pim van Pelt Date: Wed, 9 Apr 2025 22:57:24 -0500 Subject: [PATCH] A few typo and readability fixes --- content/articles/2025-04-09-frysix-evpn.md | 117 +++++++++++++-------- 1 file changed, 71 insertions(+), 46 deletions(-) diff --git a/content/articles/2025-04-09-frysix-evpn.md b/content/articles/2025-04-09-frysix-evpn.md index d8c786f..b7cbb97 100644 --- a/content/articles/2025-04-09-frysix-evpn.md +++ b/content/articles/2025-04-09-frysix-evpn.md @@ -13,13 +13,13 @@ a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHE the most densely populated facilities in western Europe. He was looking for a few launching customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on my [[bucketlist]({{< ref 2021-07-26-bucketlist.md >}})]. Arend and his IT company -[[ERITAP](https://www.eritap.com/)], took delivery in May of 2021, and this is when the internet -exchange with _Frysian roots_ was born. +[[ERITAP](https://www.eritap.com/)], took delivery of that rack in May of 2021, and this is when the +internet exchange with _Frysian roots_ was born. In the years from 2021 until now, Arend and I have been operating the exchange with reasonable success. It grew from a handful of folks in that first rack, to now some 250 participating ISPs with about ten switches in six datacenters across the Amsterdam metro area. It's shifting a cool -800Gbit of traffic or so. It's dope, and very rewarding to be a part of this community! +800Gbit of traffic or so. It's dope, and very rewarding to be able to contribute to this community! ## Frys-IX is growing @@ -27,9 +27,10 @@ We have several members with a 2x100G LAG and even though all inter-datacenter l fiber or WDM, we're starting to feel the growing pains as we set our sights to the next step growth. You see, when FrysIX did 13.37Gbit of traffic, Arend organized a barbecue. When it did 133.7Gbit of traffic, Arend organized an even bigger barbecue. Obviously, the next step is 1337Gbit and joining -the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're coming! +the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're on our +way! -It became clear that we would not be able to keep a dependable peering platform if FrysIX was a +It became clear that we will not be able to keep a dependable peering platform if FrysIX remains a single L2 broadcast domain, and it also became clear that concatenating multiple 100G ports would be operationally expensive (think of all the dark fiber or WDM waves!), and brittle (think of LACP and balancing traffic over those ports). We need to modernize in order to stay ahead of the growth @@ -42,8 +43,8 @@ curve. The Nokia 7220 Interconnect Router (7220 IXR) for data center fabric provides fixed-configuration, high-capacity platforms that let you bring unmatched scale, flexibility and operational simplicity to your data center networks and peering network environments. These devices are built around the -Broadcom _Trident_ chipset, in the case of the lefthand "D4" platform, this is a Trident4 with -28x100G and 8x400G ports. +Broadcom _Trident_ chipset, in the case of the "D4" platform, this is a Trident4 with 28x100G and +8x400G ports. Whoot! {{< image float="right" src="/assets/frys-ix/IXR-7220-D3.jpg" alt="Nokia 7220-D3" width="20em" >}} @@ -52,6 +53,7 @@ What I find particularly awesome of the Trident series is their speed (total ban a plethora of advanced capabilities like L2/L3 filtering, IPv4, IPv6 and MPLS routing, and modern approaches to scale-out networking such as VXLAN based EVPN. At the FrysIX barbecue in September of 2024, FrysIX was gifted a rather powerful IXR-7220-D3 router, shown in the picture to the right. +That's a 32x100G router. ERITAP has bought two (new in box) IXR-7220-D4 (8x400G,28x100G) routers, and has also acquired two IXR-7220-D2 (48x25G,8x100G) routers. So in total, FrysIX is now the proud owner of five of these @@ -84,13 +86,21 @@ datacenters, and it's prohibitively expensive to get 100G waves or dark fiber to It's much more economical to create a star-topology that minimizes cross-datacenter fiber spans. **Critique 3**: Most of these 'spine-leaf' reference architectures assume that the interior gateway -protocol is EBGP in what they call the _underlay_, and on top of that, some secondary EBGP that's +protocol is eBGP in what they call the _underlay_, and on top of that, some secondary eBGP that's called the _overlay_. Frankly, such a design makes my head spin a little bit. These designs assume hundreds of switches, in which case making use of one AS number per switch could make sense (as iBGP needs either a 'full mesh', or external route reflectors). -Setting aside eVPN for a second, if I were to build a transport network, much like [[IPng Site -Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a simpler design: +**Critique 4**: These reference designs also make an assumption that all fiber is local and while +links can fail, it will be relatively rare to _drain_ a link. However, in cross-datacenter networks, +draining links for maintenance is very common, for example if the dark fiber provider needs to +perform maintenance. With these eBGP-over-eBGP connections, traffic engineering is more difficult +than simply raising the OSPF (or IS-IS) cost of a link, to reroute traffic. + +Setting aside eVPN for a second, if I were to build an IP transport network, like I did when I built +[[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a much more intuitive +and simple (I would even dare say elegant) design: + 1. Take a classic IGP like [[OSPF](https://en.wikipedia.org/wiki/Open_Shortest_Path_First)], or perhaps [[IS-IS](https://en.wikipedia.org/wiki/IS-IS)]. There is no benefit, to me at least, to use BGP as an IGP. @@ -113,7 +123,8 @@ leave your comments below, and don't forget to like-and-subscribe :-) Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two 400G-capable switches and connects them. Then he takes an Arista DCS-7060CX switch (which is eVPN capable, with 32x100G ports, based on the Broadcom Tomahawk3 chipset), and a smaller Nokia -IXR-7220-D2 (with 48x25G and 8x100G ports, based on the Trident3 chipset). +IXR-7220-D2 (with 48x25G and 8x100G ports, based on the Trident3 chipset). He wires all of this up +to look like the picture on the right. #### Underlay: Nokia's SR Linux @@ -157,7 +168,7 @@ A:linuxadmin@nikhef# commit stay Cool. Assuming I now also do this on the other IXR-7220-D4 router, called _equinix_ (which gets the loopback address 198.19.16.0/32 and the point-to-point on the 400G interface of 198.19.17.0/31), I -should be able to do my first ping: +should be able to do my first jumboframe ping: ``` A:linuxadmin@equinix# ping network-instance default 198.19.17.1 -s 9162 -M do @@ -187,8 +198,8 @@ A:linuxadmin@nikhef# commit stay Similar to in JunOS, I can descend into a configuration scope (the first line goes into the _network-instance_ called `default` and then the _protocols_ called `ospf`, and then the _instance_ called `default`. Subsequent `set` commands operate at this scope. Once I commit this configuration -(on the nikhef router and also the equinix router, with its own router-id), OSPF shoots to life -immediately: +(on the _nikhef_ router and also the _equinix_ router, with its own unique router-id), OSPF shoots +to life immediately: ``` A:linuxadmin@nikhef# show network-instance default protocols ospf neighbor @@ -230,17 +241,18 @@ Delicious! OSPF has learned the loopback, and it is now reachable. As with most to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on, going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on -the Nikhef router, using ethernet-1/1.0 through ethernet-1/4.0 with the correct MTU and turning on OSPF -for these), makes the whole network shoot to life. +the _nikhef_ router, using ethernet-1/1.0 through ethernet-1/4.0 with the correct MTU and turning on OSPF +for these), makes the whole network shoot to life. Slick! #### Underlay: Arista I'll point out that one of the devices in this topology is an Arista. We have several of these ready -for deployment at FrysIX. They are a lot more affordable, come with 32x100G ports, and are really -good at packet slinging because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less -faetures than the _Trident_ chipset, but they happen to have all the features we need to run our -internet exchange . So I turn my attention to the Arista in the topology. I am much more comfortable -configuring the whole thing here, as it's not my first time touching these devices: +for deployment at FrysIX. They are a lot more affordable and easy to find on the second hand / +refurbished market. These switches come with 32x100G ports, and are really good at packet slinging +because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less faetures than the +_Trident_ chipset that powers the Nokia, but they happen to have all the features we need to run our +internet exchange . So I turn my attention to the Arista in the topology. I am much more +comfortable configuring the whole thing here, as it's not my first time touching these devices: ``` arista-leaf#show run int loop0 @@ -266,8 +278,8 @@ router ospf 65500 ``` I complete the configuration for the other two core ports on this Arista, port Eth31/1 connects also -to the nikhef IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to -the nokia-leaf IXR-7220-D2 with a cost of 10. +to the _nikhef_ IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to +the _nokia-leaf_ IXR-7220-D2 with a cost of 10. It's nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via router-id 198.19.16.1 (nikhef), and there's one lower cost path via router-id 198.19.16.3 (nokia-leaf). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia -> @@ -292,13 +304,13 @@ high cost for now. #### Overlay EVPN: SR Linux -The big-idea here is to use iBGP with the same AS number, and because there are two main facilities -(NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as route-reflectors for -others. It means that they will have an iBGP session amongst themselves (198.191.16.0 <-> -198.19.16.1) and otherwise accept iBGP sessions from any IP address in the 198.19.16.0/24 subnet. -This way, I don't have to configure any more than strictly necessary on the core routers. Any new -router can just plug in, form an OSPF adjacency, and connect to both core routers. I proceed to -configure the Nokia's like this: +The big-picture idea here is to use iBGP with the same AS number, and because there are two main +facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as +route-reflectors for others. It means that they will have an iBGP session amongst themselves +(198.191.16.0 <-> 198.19.16.1) and otherwise accept iBGP sessions from any IP address in the +198.19.16.0/24 subnet. This way, I don't have to configure any more than strictly necessary on the +core routers. Any new router can just plug in, form an OSPF adjacency, and connect to both core +routers. I proceed to configure BGP on the Nokia's like this: ``` A:linuxadmin@nikhef# / network-instance default protocols bgp A:linuxadmin@nikhef# set admin-state enable @@ -346,7 +358,7 @@ Summary: A few things to note here - there one _configured_ neighbor (this is the other IXR-7220-D4 router), and two _dynamic_ peers, these are the Arista and the smaller IXR-7220-D2 router. The only address family that they are exchanging information for is the _evpn_ family, and no prefixes have been -learned or sent het (that's the `[0/0/0]` designation in the last column). +learned or sent yet (that's the `[0/0/0]` designation in the last column). #### Overlay EVPN: Arista @@ -382,11 +394,14 @@ Neighbor AS Session State AFI/SAFI AFI/SAFI State N On this leaf node, I'll have a redundant iBGP session with the two core nodes. Since those core nodes are peering amongst themselves, and are configured as route-reflectors, this is all I need. No matter how many additional Arista (or Nokia) devices I add to the network, all they'll have to do is -enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sesions with both. Voila! +enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sesions with both core routers. +Voila! #### VXLAN EVPN: SR Linux -Nokia informs me that it uses a special interface called _system0_ to source its VXLAN traffic from. -So it's a matter of defining that interface and associate a VXLAN interface with it, like so: + +Nokia documentation informs me that SR Linux uses a special interface called _system0_ to source its +VXLAN traffic from, and add the interface to the _default_ network-instance. So it's a matter of +defining that interface and associate a VXLAN interface with it, like so: ``` A:linuxadmin@nikhef# set / interface system0 admin-state enable @@ -430,8 +445,8 @@ A:linuxadmin@nikhef# set protocols bgp-vpn bgp-instance 1 route-target import-rt A:linuxadmin@nikhef# commit stay ``` -In the first block here, I take what is a 100G port called `ethernet-1/9` and I split it into 4x25G -ports. I'll force the port speed to 10G because Arend has taken a 40G-4x10G DAC, and it happens that +In the first block here, Arend took what is a 100G port called `ethernet-1/9` and split it into 4x25G +ports. Arend forced the port speed to 10G because he has taken a 40G-4x10G DAC, and it happens that the third lane is plugged into the Debian machine. So on `ethernet-1/9/3` I'll create a sub-interface, make it type _bridged_ (which I've also done on `vxlan1.2604`!) and allow any untagged traffic to enter it. @@ -444,13 +459,13 @@ previous [[article]({{< ref 2022-02-14-vpp-vlan-gym.md >}})] which my buddy Fred _VLAN Gymnastics_ because the ports are just so damn flexible. Worth a read! The second block creates a new _network-instance_ which I'll name `peeringlan`, and it associates -the newly crated untagged sub-interface ethernet-1/9/3.0 with with the VXLAN interface, and starts a +the newly crated untagged sub-interface `ethernet-1/9/3.0` with with the VXLAN interface, and starts a protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the VXLAN interface, and signalling of all MAC addresses learned to use route-distinguisher and import/export route-targets. For simplicity I've just used the same for each: 65500:2604. I continue to add an interface to the `peeringlan` _network-instance_ on the other two Nokia -routers: `ethernet-1/9/3.0` on the equinix router and `ethernet-1/9.0` on the nokia-leaf router. +routers: `ethernet-1/9/3.0` on the _equinix_ router and `ethernet-1/9.0` on the _nokia-leaf_ router. Each of these goes to a 10Gbps port on a Debian machine. #### VXLAN EVPN: Arista @@ -476,7 +491,8 @@ interface Vxlan1 After creating VLAN 2604 on making port Eth9/3 an access port in that VLAN, I'll add a VTEP endpoint called `Loopback1`, and a VXLAN interface that uses that to source its traffic. Here, I'll associate -local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias. +local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias +previously. Finally, it's a matter of tying these together by announcing the MAC addresses into the EVPN iBGP sessions: @@ -573,11 +589,11 @@ AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Li * ec RD: 65500:2604 imet 198.19.18.3 198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0 ``` -There's a lot to unpack here! The Arista is seeing that from the _route-discriminator_ I configured +There's a lot to unpack here! The Arista is seeing that from the _route-distinguisher_ I configured on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for -the nokia-leaf router) from both iBGP sessions. The MAC address is learned from originator +the _nokia-leaf_ router) from both iBGP sessions. The MAC address is learned from originator 198.19.16.3 (the loopback of the nokia-leaf router), from two cluster members, the _active_ one on -iBGP speaker 198.19.16.1 (nikhef) and a backup member on 198.19.16.0 (equinix). +iBGP speaker 198.19.16.1 (_nikhef_) and a backup member on 198.19.16.0 (_equinix_). I can also see that there's a bunch of `imet` route entries, and Andy explained these to me. They are a signal from a VTEP participant that they are interested in seeing multicast traffic (like neighbor @@ -643,10 +659,10 @@ entries. One thing to note -- the SR Linux implementation leaves the type-2 rout #### Results: Debian view -There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. Arend -hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+ connections. -This network card is a regular in my AS8298 network, as it has excellent DPDK support and can pump -easily 40Mpps with VPP. IPng 🥰 Intel X710! +There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. As I said, +Arend hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+ +connections. This network card is a regular in my AS8298 network, as it has excellent DPDK support +and can pump easily 40Mpps with VPP. IPng 🥰 Intel X710! ``` root@debian:~ # ip netns add nikhef @@ -755,3 +771,12 @@ as well as SR Linux, and in particular wanted to give a big "Thank you!" for hel symmetric and asymmetric IRB in the context of multivendor EVPN. Andy is about to start a new job at Nokia, and I wish him all the best. To my friends at Nokia: you caught a good one, Andy is pure gold! + +I also want to thank Niek for helping me take my first baby steps onto this platform and patiently +answering my nerdly questions about the platform, the switch chip, and the configuration philosophy. +Learning a new NOS is always a fun task, and it was made super fun because Niek spent an hour with +Arend and me on a video call, giving a bunch of operational tips and tricks along the way. + +Finally, Arend and ERITAP are an absolute joy to work with. We took turns hacking on the lab, which +Arend made available for me while I am traveling to Mississippi this week. Thanks for the kWh and +OOB access, and for brainstorming the config with me!