A few typo and readability fixes
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
2025-04-09 22:57:24 -05:00
parent d9e2f407e7
commit 8a991bee47

View File

@ -13,13 +13,13 @@ a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHE
the most densely populated facilities in western Europe. He was looking for a few launching
customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on
my [[bucketlist]({{< ref 2021-07-26-bucketlist.md >}})]. Arend and his IT company
[[ERITAP](https://www.eritap.com/)], took delivery in May of 2021, and this is when the internet
exchange with _Frysian roots_ was born.
[[ERITAP](https://www.eritap.com/)], took delivery of that rack in May of 2021, and this is when the
internet exchange with _Frysian roots_ was born.
In the years from 2021 until now, Arend and I have been operating the exchange with reasonable
success. It grew from a handful of folks in that first rack, to now some 250 participating ISPs
with about ten switches in six datacenters across the Amsterdam metro area. It's shifting a cool
800Gbit of traffic or so. It's dope, and very rewarding to be a part of this community!
800Gbit of traffic or so. It's dope, and very rewarding to be able to contribute to this community!
## Frys-IX is growing
@ -27,9 +27,10 @@ We have several members with a 2x100G LAG and even though all inter-datacenter l
fiber or WDM, we're starting to feel the growing pains as we set our sights to the next step growth.
You see, when FrysIX did 13.37Gbit of traffic, Arend organized a barbecue. When it did 133.7Gbit of
traffic, Arend organized an even bigger barbecue. Obviously, the next step is 1337Gbit and joining
the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're coming!
the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're on our
way!
It became clear that we would not be able to keep a dependable peering platform if FrysIX was a
It became clear that we will not be able to keep a dependable peering platform if FrysIX remains a
single L2 broadcast domain, and it also became clear that concatenating multiple 100G ports would be
operationally expensive (think of all the dark fiber or WDM waves!), and brittle (think of LACP and
balancing traffic over those ports). We need to modernize in order to stay ahead of the growth
@ -42,8 +43,8 @@ curve.
The Nokia 7220 Interconnect Router (7220 IXR) for data center fabric provides fixed-configuration,
high-capacity platforms that let you bring unmatched scale, flexibility and operational simplicity
to your data center networks and peering network environments. These devices are built around the
Broadcom _Trident_ chipset, in the case of the lefthand "D4" platform, this is a Trident4 with
28x100G and 8x400G ports.
Broadcom _Trident_ chipset, in the case of the "D4" platform, this is a Trident4 with 28x100G and
8x400G ports. Whoot!
{{< image float="right" src="/assets/frys-ix/IXR-7220-D3.jpg" alt="Nokia 7220-D3" width="20em" >}}
@ -52,6 +53,7 @@ What I find particularly awesome of the Trident series is their speed (total ban
a plethora of advanced capabilities like L2/L3 filtering, IPv4, IPv6 and MPLS routing, and modern
approaches to scale-out networking such as VXLAN based EVPN. At the FrysIX barbecue in September of
2024, FrysIX was gifted a rather powerful IXR-7220-D3 router, shown in the picture to the right.
That's a 32x100G router.
ERITAP has bought two (new in box) IXR-7220-D4 (8x400G,28x100G) routers, and has also acquired two
IXR-7220-D2 (48x25G,8x100G) routers. So in total, FrysIX is now the proud owner of five of these
@ -84,13 +86,21 @@ datacenters, and it's prohibitively expensive to get 100G waves or dark fiber to
It's much more economical to create a star-topology that minimizes cross-datacenter fiber spans.
**Critique 3**: Most of these 'spine-leaf' reference architectures assume that the interior gateway
protocol is EBGP in what they call the _underlay_, and on top of that, some secondary EBGP that's
protocol is eBGP in what they call the _underlay_, and on top of that, some secondary eBGP that's
called the _overlay_. Frankly, such a design makes my head spin a little bit. These designs assume
hundreds of switches, in which case making use of one AS number per switch could make sense (as iBGP
needs either a 'full mesh', or external route reflectors).
Setting aside eVPN for a second, if I were to build a transport network, much like [[IPng Site
Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a simpler design:
**Critique 4**: These reference designs also make an assumption that all fiber is local and while
links can fail, it will be relatively rare to _drain_ a link. However, in cross-datacenter networks,
draining links for maintenance is very common, for example if the dark fiber provider needs to
perform maintenance. With these eBGP-over-eBGP connections, traffic engineering is more difficult
than simply raising the OSPF (or IS-IS) cost of a link, to reroute traffic.
Setting aside eVPN for a second, if I were to build an IP transport network, like I did when I built
[[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a much more intuitive
and simple (I would even dare say elegant) design:
1. Take a classic IGP like [[OSPF](https://en.wikipedia.org/wiki/Open_Shortest_Path_First)], or
perhaps [[IS-IS](https://en.wikipedia.org/wiki/IS-IS)]. There is no benefit, to me at least, to use
BGP as an IGP.
@ -113,7 +123,8 @@ leave your comments below, and don't forget to like-and-subscribe :-)
Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two
400G-capable switches and connects them. Then he takes an Arista DCS-7060CX switch (which is eVPN
capable, with 32x100G ports, based on the Broadcom Tomahawk3 chipset), and a smaller Nokia
IXR-7220-D2 (with 48x25G and 8x100G ports, based on the Trident3 chipset).
IXR-7220-D2 (with 48x25G and 8x100G ports, based on the Trident3 chipset). He wires all of this up
to look like the picture on the right.
#### Underlay: Nokia's SR Linux
@ -157,7 +168,7 @@ A:linuxadmin@nikhef# commit stay
Cool. Assuming I now also do this on the other IXR-7220-D4 router, called _equinix_ (which gets the
loopback address 198.19.16.0/32 and the point-to-point on the 400G interface of 198.19.17.0/31), I
should be able to do my first ping:
should be able to do my first jumboframe ping:
```
A:linuxadmin@equinix# ping network-instance default 198.19.17.1 -s 9162 -M do
@ -187,8 +198,8 @@ A:linuxadmin@nikhef# commit stay
Similar to in JunOS, I can descend into a configuration scope (the first line goes into the
_network-instance_ called `default` and then the _protocols_ called `ospf`, and then the _instance_
called `default`. Subsequent `set` commands operate at this scope. Once I commit this configuration
(on the nikhef router and also the equinix router, with its own router-id), OSPF shoots to life
immediately:
(on the _nikhef_ router and also the _equinix_ router, with its own unique router-id), OSPF shoots
to life immediately:
```
A:linuxadmin@nikhef# show network-instance default protocols ospf neighbor
@ -230,17 +241,18 @@ Delicious! OSPF has learned the loopback, and it is now reachable. As with most
to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going
from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on,
going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on
the Nikhef router, using ethernet-1/1.0 through ethernet-1/4.0 with the correct MTU and turning on OSPF
for these), makes the whole network shoot to life.
the _nikhef_ router, using ethernet-1/1.0 through ethernet-1/4.0 with the correct MTU and turning on OSPF
for these), makes the whole network shoot to life. Slick!
#### Underlay: Arista
I'll point out that one of the devices in this topology is an Arista. We have several of these ready
for deployment at FrysIX. They are a lot more affordable, come with 32x100G ports, and are really
good at packet slinging because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less
faetures than the _Trident_ chipset, but they happen to have all the features we need to run our
internet exchange . So I turn my attention to the Arista in the topology. I am much more comfortable
configuring the whole thing here, as it's not my first time touching these devices:
for deployment at FrysIX. They are a lot more affordable and easy to find on the second hand /
refurbished market. These switches come with 32x100G ports, and are really good at packet slinging
because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less faetures than the
_Trident_ chipset that powers the Nokia, but they happen to have all the features we need to run our
internet exchange . So I turn my attention to the Arista in the topology. I am much more
comfortable configuring the whole thing here, as it's not my first time touching these devices:
```
arista-leaf#show run int loop0
@ -266,8 +278,8 @@ router ospf 65500
```
I complete the configuration for the other two core ports on this Arista, port Eth31/1 connects also
to the nikhef IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to
the nokia-leaf IXR-7220-D2 with a cost of 10.
to the _nikhef_ IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to
the _nokia-leaf_ IXR-7220-D2 with a cost of 10.
It's nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via
router-id 198.19.16.1 (nikhef), and there's one lower cost path via router-id 198.19.16.3
(nokia-leaf). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia ->
@ -292,13 +304,13 @@ high cost for now.
#### Overlay EVPN: SR Linux
The big-idea here is to use iBGP with the same AS number, and because there are two main facilities
(NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as route-reflectors for
others. It means that they will have an iBGP session amongst themselves (198.191.16.0 <->
198.19.16.1) and otherwise accept iBGP sessions from any IP address in the 198.19.16.0/24 subnet.
This way, I don't have to configure any more than strictly necessary on the core routers. Any new
router can just plug in, form an OSPF adjacency, and connect to both core routers. I proceed to
configure the Nokia's like this:
The big-picture idea here is to use iBGP with the same AS number, and because there are two main
facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as
route-reflectors for others. It means that they will have an iBGP session amongst themselves
(198.191.16.0 <-> 198.19.16.1) and otherwise accept iBGP sessions from any IP address in the
198.19.16.0/24 subnet. This way, I don't have to configure any more than strictly necessary on the
core routers. Any new router can just plug in, form an OSPF adjacency, and connect to both core
routers. I proceed to configure BGP on the Nokia's like this:
```
A:linuxadmin@nikhef# / network-instance default protocols bgp
A:linuxadmin@nikhef# set admin-state enable
@ -346,7 +358,7 @@ Summary:
A few things to note here - there one _configured_ neighbor (this is the other IXR-7220-D4 router),
and two _dynamic_ peers, these are the Arista and the smaller IXR-7220-D2 router. The only address
family that they are exchanging information for is the _evpn_ family, and no prefixes have been
learned or sent het (that's the `[0/0/0]` designation in the last column).
learned or sent yet (that's the `[0/0/0]` designation in the last column).
#### Overlay EVPN: Arista
@ -382,11 +394,14 @@ Neighbor AS Session State AFI/SAFI AFI/SAFI State N
On this leaf node, I'll have a redundant iBGP session with the two core nodes. Since those core
nodes are peering amongst themselves, and are configured as route-reflectors, this is all I need. No
matter how many additional Arista (or Nokia) devices I add to the network, all they'll have to do is
enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sesions with both. Voila!
enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sesions with both core routers.
Voila!
#### VXLAN EVPN: SR Linux
Nokia informs me that it uses a special interface called _system0_ to source its VXLAN traffic from.
So it's a matter of defining that interface and associate a VXLAN interface with it, like so:
Nokia documentation informs me that SR Linux uses a special interface called _system0_ to source its
VXLAN traffic from, and add the interface to the _default_ network-instance. So it's a matter of
defining that interface and associate a VXLAN interface with it, like so:
```
A:linuxadmin@nikhef# set / interface system0 admin-state enable
@ -430,8 +445,8 @@ A:linuxadmin@nikhef# set protocols bgp-vpn bgp-instance 1 route-target import-rt
A:linuxadmin@nikhef# commit stay
```
In the first block here, I take what is a 100G port called `ethernet-1/9` and I split it into 4x25G
ports. I'll force the port speed to 10G because Arend has taken a 40G-4x10G DAC, and it happens that
In the first block here, Arend took what is a 100G port called `ethernet-1/9` and split it into 4x25G
ports. Arend forced the port speed to 10G because he has taken a 40G-4x10G DAC, and it happens that
the third lane is plugged into the Debian machine. So on `ethernet-1/9/3` I'll create a
sub-interface, make it type _bridged_ (which I've also done on `vxlan1.2604`!) and allow any
untagged traffic to enter it.
@ -444,13 +459,13 @@ previous [[article]({{< ref 2022-02-14-vpp-vlan-gym.md >}})] which my buddy Fred
_VLAN Gymnastics_ because the ports are just so damn flexible. Worth a read!
The second block creates a new _network-instance_ which I'll name `peeringlan`, and it associates
the newly crated untagged sub-interface ethernet-1/9/3.0 with with the VXLAN interface, and starts a
the newly crated untagged sub-interface `ethernet-1/9/3.0` with with the VXLAN interface, and starts a
protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the
VXLAN interface, and signalling of all MAC addresses learned to use route-distinguisher and
import/export route-targets. For simplicity I've just used the same for each: 65500:2604.
I continue to add an interface to the `peeringlan` _network-instance_ on the other two Nokia
routers: `ethernet-1/9/3.0` on the equinix router and `ethernet-1/9.0` on the nokia-leaf router.
routers: `ethernet-1/9/3.0` on the _equinix_ router and `ethernet-1/9.0` on the _nokia-leaf_ router.
Each of these goes to a 10Gbps port on a Debian machine.
#### VXLAN EVPN: Arista
@ -476,7 +491,8 @@ interface Vxlan1
After creating VLAN 2604 on making port Eth9/3 an access port in that VLAN, I'll add a VTEP endpoint
called `Loopback1`, and a VXLAN interface that uses that to source its traffic. Here, I'll associate
local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias.
local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias
previously.
Finally, it's a matter of tying these together by announcing the MAC addresses into the EVPN iBGP
sessions:
@ -573,11 +589,11 @@ AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Li
* ec RD: 65500:2604 imet 198.19.18.3
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0
```
There's a lot to unpack here! The Arista is seeing that from the _route-discriminator_ I configured
There's a lot to unpack here! The Arista is seeing that from the _route-distinguisher_ I configured
on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for
the nokia-leaf router) from both iBGP sessions. The MAC address is learned from originator
the _nokia-leaf_ router) from both iBGP sessions. The MAC address is learned from originator
198.19.16.3 (the loopback of the nokia-leaf router), from two cluster members, the _active_ one on
iBGP speaker 198.19.16.1 (nikhef) and a backup member on 198.19.16.0 (equinix).
iBGP speaker 198.19.16.1 (_nikhef_) and a backup member on 198.19.16.0 (_equinix_).
I can also see that there's a bunch of `imet` route entries, and Andy explained these to me. They are
a signal from a VTEP participant that they are interested in seeing multicast traffic (like neighbor
@ -643,10 +659,10 @@ entries. One thing to note -- the SR Linux implementation leaves the type-2 rout
#### Results: Debian view
There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. Arend
hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+ connections.
This network card is a regular in my AS8298 network, as it has excellent DPDK support and can pump
easily 40Mpps with VPP. IPng 🥰 Intel X710!
There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. As I said,
Arend hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+
connections. This network card is a regular in my AS8298 network, as it has excellent DPDK support
and can pump easily 40Mpps with VPP. IPng 🥰 Intel X710!
```
root@debian:~ # ip netns add nikhef
@ -755,3 +771,12 @@ as well as SR Linux, and in particular wanted to give a big "Thank you!" for hel
symmetric and asymmetric IRB in the context of multivendor EVPN. Andy is about to start a new job at
Nokia, and I wish him all the best. To my friends at Nokia: you caught a good one, Andy is pure
gold!
I also want to thank Niek for helping me take my first baby steps onto this platform and patiently
answering my nerdly questions about the platform, the switch chip, and the configuration philosophy.
Learning a new NOS is always a fun task, and it was made super fun because Niek spent an hour with
Arend and me on a video call, giving a bunch of operational tips and tricks along the way.
Finally, Arend and ERITAP are an absolute joy to work with. We took turns hacking on the lab, which
Arend made available for me while I am traveling to Mississippi this week. Thanks for the kWh and
OOB access, and for brainstorming the config with me!