A few typo and readability fixes
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -13,13 +13,13 @@ a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHE
|
||||
the most densely populated facilities in western Europe. He was looking for a few launching
|
||||
customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on
|
||||
my [[bucketlist]({{< ref 2021-07-26-bucketlist.md >}})]. Arend and his IT company
|
||||
[[ERITAP](https://www.eritap.com/)], took delivery in May of 2021, and this is when the internet
|
||||
exchange with _Frysian roots_ was born.
|
||||
[[ERITAP](https://www.eritap.com/)], took delivery of that rack in May of 2021, and this is when the
|
||||
internet exchange with _Frysian roots_ was born.
|
||||
|
||||
In the years from 2021 until now, Arend and I have been operating the exchange with reasonable
|
||||
success. It grew from a handful of folks in that first rack, to now some 250 participating ISPs
|
||||
with about ten switches in six datacenters across the Amsterdam metro area. It's shifting a cool
|
||||
800Gbit of traffic or so. It's dope, and very rewarding to be a part of this community!
|
||||
800Gbit of traffic or so. It's dope, and very rewarding to be able to contribute to this community!
|
||||
|
||||
## Frys-IX is growing
|
||||
|
||||
@ -27,9 +27,10 @@ We have several members with a 2x100G LAG and even though all inter-datacenter l
|
||||
fiber or WDM, we're starting to feel the growing pains as we set our sights to the next step growth.
|
||||
You see, when FrysIX did 13.37Gbit of traffic, Arend organized a barbecue. When it did 133.7Gbit of
|
||||
traffic, Arend organized an even bigger barbecue. Obviously, the next step is 1337Gbit and joining
|
||||
the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're coming!
|
||||
the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're on our
|
||||
way!
|
||||
|
||||
It became clear that we would not be able to keep a dependable peering platform if FrysIX was a
|
||||
It became clear that we will not be able to keep a dependable peering platform if FrysIX remains a
|
||||
single L2 broadcast domain, and it also became clear that concatenating multiple 100G ports would be
|
||||
operationally expensive (think of all the dark fiber or WDM waves!), and brittle (think of LACP and
|
||||
balancing traffic over those ports). We need to modernize in order to stay ahead of the growth
|
||||
@ -42,8 +43,8 @@ curve.
|
||||
The Nokia 7220 Interconnect Router (7220 IXR) for data center fabric provides fixed-configuration,
|
||||
high-capacity platforms that let you bring unmatched scale, flexibility and operational simplicity
|
||||
to your data center networks and peering network environments. These devices are built around the
|
||||
Broadcom _Trident_ chipset, in the case of the lefthand "D4" platform, this is a Trident4 with
|
||||
28x100G and 8x400G ports.
|
||||
Broadcom _Trident_ chipset, in the case of the "D4" platform, this is a Trident4 with 28x100G and
|
||||
8x400G ports. Whoot!
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/IXR-7220-D3.jpg" alt="Nokia 7220-D3" width="20em" >}}
|
||||
|
||||
@ -52,6 +53,7 @@ What I find particularly awesome of the Trident series is their speed (total ban
|
||||
a plethora of advanced capabilities like L2/L3 filtering, IPv4, IPv6 and MPLS routing, and modern
|
||||
approaches to scale-out networking such as VXLAN based EVPN. At the FrysIX barbecue in September of
|
||||
2024, FrysIX was gifted a rather powerful IXR-7220-D3 router, shown in the picture to the right.
|
||||
That's a 32x100G router.
|
||||
|
||||
ERITAP has bought two (new in box) IXR-7220-D4 (8x400G,28x100G) routers, and has also acquired two
|
||||
IXR-7220-D2 (48x25G,8x100G) routers. So in total, FrysIX is now the proud owner of five of these
|
||||
@ -84,13 +86,21 @@ datacenters, and it's prohibitively expensive to get 100G waves or dark fiber to
|
||||
It's much more economical to create a star-topology that minimizes cross-datacenter fiber spans.
|
||||
|
||||
**Critique 3**: Most of these 'spine-leaf' reference architectures assume that the interior gateway
|
||||
protocol is EBGP in what they call the _underlay_, and on top of that, some secondary EBGP that's
|
||||
protocol is eBGP in what they call the _underlay_, and on top of that, some secondary eBGP that's
|
||||
called the _overlay_. Frankly, such a design makes my head spin a little bit. These designs assume
|
||||
hundreds of switches, in which case making use of one AS number per switch could make sense (as iBGP
|
||||
needs either a 'full mesh', or external route reflectors).
|
||||
|
||||
Setting aside eVPN for a second, if I were to build a transport network, much like [[IPng Site
|
||||
Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a simpler design:
|
||||
**Critique 4**: These reference designs also make an assumption that all fiber is local and while
|
||||
links can fail, it will be relatively rare to _drain_ a link. However, in cross-datacenter networks,
|
||||
draining links for maintenance is very common, for example if the dark fiber provider needs to
|
||||
perform maintenance. With these eBGP-over-eBGP connections, traffic engineering is more difficult
|
||||
than simply raising the OSPF (or IS-IS) cost of a link, to reroute traffic.
|
||||
|
||||
Setting aside eVPN for a second, if I were to build an IP transport network, like I did when I built
|
||||
[[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a much more intuitive
|
||||
and simple (I would even dare say elegant) design:
|
||||
|
||||
1. Take a classic IGP like [[OSPF](https://en.wikipedia.org/wiki/Open_Shortest_Path_First)], or
|
||||
perhaps [[IS-IS](https://en.wikipedia.org/wiki/IS-IS)]. There is no benefit, to me at least, to use
|
||||
BGP as an IGP.
|
||||
@ -113,7 +123,8 @@ leave your comments below, and don't forget to like-and-subscribe :-)
|
||||
Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two
|
||||
400G-capable switches and connects them. Then he takes an Arista DCS-7060CX switch (which is eVPN
|
||||
capable, with 32x100G ports, based on the Broadcom Tomahawk3 chipset), and a smaller Nokia
|
||||
IXR-7220-D2 (with 48x25G and 8x100G ports, based on the Trident3 chipset).
|
||||
IXR-7220-D2 (with 48x25G and 8x100G ports, based on the Trident3 chipset). He wires all of this up
|
||||
to look like the picture on the right.
|
||||
|
||||
#### Underlay: Nokia's SR Linux
|
||||
|
||||
@ -157,7 +168,7 @@ A:linuxadmin@nikhef# commit stay
|
||||
|
||||
Cool. Assuming I now also do this on the other IXR-7220-D4 router, called _equinix_ (which gets the
|
||||
loopback address 198.19.16.0/32 and the point-to-point on the 400G interface of 198.19.17.0/31), I
|
||||
should be able to do my first ping:
|
||||
should be able to do my first jumboframe ping:
|
||||
|
||||
```
|
||||
A:linuxadmin@equinix# ping network-instance default 198.19.17.1 -s 9162 -M do
|
||||
@ -187,8 +198,8 @@ A:linuxadmin@nikhef# commit stay
|
||||
Similar to in JunOS, I can descend into a configuration scope (the first line goes into the
|
||||
_network-instance_ called `default` and then the _protocols_ called `ospf`, and then the _instance_
|
||||
called `default`. Subsequent `set` commands operate at this scope. Once I commit this configuration
|
||||
(on the nikhef router and also the equinix router, with its own router-id), OSPF shoots to life
|
||||
immediately:
|
||||
(on the _nikhef_ router and also the _equinix_ router, with its own unique router-id), OSPF shoots
|
||||
to life immediately:
|
||||
|
||||
```
|
||||
A:linuxadmin@nikhef# show network-instance default protocols ospf neighbor
|
||||
@ -230,17 +241,18 @@ Delicious! OSPF has learned the loopback, and it is now reachable. As with most
|
||||
to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going
|
||||
from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on,
|
||||
going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on
|
||||
the Nikhef router, using ethernet-1/1.0 through ethernet-1/4.0 with the correct MTU and turning on OSPF
|
||||
for these), makes the whole network shoot to life.
|
||||
the _nikhef_ router, using ethernet-1/1.0 through ethernet-1/4.0 with the correct MTU and turning on OSPF
|
||||
for these), makes the whole network shoot to life. Slick!
|
||||
|
||||
#### Underlay: Arista
|
||||
|
||||
I'll point out that one of the devices in this topology is an Arista. We have several of these ready
|
||||
for deployment at FrysIX. They are a lot more affordable, come with 32x100G ports, and are really
|
||||
good at packet slinging because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less
|
||||
faetures than the _Trident_ chipset, but they happen to have all the features we need to run our
|
||||
internet exchange . So I turn my attention to the Arista in the topology. I am much more comfortable
|
||||
configuring the whole thing here, as it's not my first time touching these devices:
|
||||
for deployment at FrysIX. They are a lot more affordable and easy to find on the second hand /
|
||||
refurbished market. These switches come with 32x100G ports, and are really good at packet slinging
|
||||
because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less faetures than the
|
||||
_Trident_ chipset that powers the Nokia, but they happen to have all the features we need to run our
|
||||
internet exchange . So I turn my attention to the Arista in the topology. I am much more
|
||||
comfortable configuring the whole thing here, as it's not my first time touching these devices:
|
||||
|
||||
```
|
||||
arista-leaf#show run int loop0
|
||||
@ -266,8 +278,8 @@ router ospf 65500
|
||||
```
|
||||
|
||||
I complete the configuration for the other two core ports on this Arista, port Eth31/1 connects also
|
||||
to the nikhef IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to
|
||||
the nokia-leaf IXR-7220-D2 with a cost of 10.
|
||||
to the _nikhef_ IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to
|
||||
the _nokia-leaf_ IXR-7220-D2 with a cost of 10.
|
||||
It's nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via
|
||||
router-id 198.19.16.1 (nikhef), and there's one lower cost path via router-id 198.19.16.3
|
||||
(nokia-leaf). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia ->
|
||||
@ -292,13 +304,13 @@ high cost for now.
|
||||
|
||||
#### Overlay EVPN: SR Linux
|
||||
|
||||
The big-idea here is to use iBGP with the same AS number, and because there are two main facilities
|
||||
(NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as route-reflectors for
|
||||
others. It means that they will have an iBGP session amongst themselves (198.191.16.0 <->
|
||||
198.19.16.1) and otherwise accept iBGP sessions from any IP address in the 198.19.16.0/24 subnet.
|
||||
This way, I don't have to configure any more than strictly necessary on the core routers. Any new
|
||||
router can just plug in, form an OSPF adjacency, and connect to both core routers. I proceed to
|
||||
configure the Nokia's like this:
|
||||
The big-picture idea here is to use iBGP with the same AS number, and because there are two main
|
||||
facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as
|
||||
route-reflectors for others. It means that they will have an iBGP session amongst themselves
|
||||
(198.191.16.0 <-> 198.19.16.1) and otherwise accept iBGP sessions from any IP address in the
|
||||
198.19.16.0/24 subnet. This way, I don't have to configure any more than strictly necessary on the
|
||||
core routers. Any new router can just plug in, form an OSPF adjacency, and connect to both core
|
||||
routers. I proceed to configure BGP on the Nokia's like this:
|
||||
```
|
||||
A:linuxadmin@nikhef# / network-instance default protocols bgp
|
||||
A:linuxadmin@nikhef# set admin-state enable
|
||||
@ -346,7 +358,7 @@ Summary:
|
||||
A few things to note here - there one _configured_ neighbor (this is the other IXR-7220-D4 router),
|
||||
and two _dynamic_ peers, these are the Arista and the smaller IXR-7220-D2 router. The only address
|
||||
family that they are exchanging information for is the _evpn_ family, and no prefixes have been
|
||||
learned or sent het (that's the `[0/0/0]` designation in the last column).
|
||||
learned or sent yet (that's the `[0/0/0]` designation in the last column).
|
||||
|
||||
#### Overlay EVPN: Arista
|
||||
|
||||
@ -382,11 +394,14 @@ Neighbor AS Session State AFI/SAFI AFI/SAFI State N
|
||||
On this leaf node, I'll have a redundant iBGP session with the two core nodes. Since those core
|
||||
nodes are peering amongst themselves, and are configured as route-reflectors, this is all I need. No
|
||||
matter how many additional Arista (or Nokia) devices I add to the network, all they'll have to do is
|
||||
enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sesions with both. Voila!
|
||||
enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sesions with both core routers.
|
||||
Voila!
|
||||
|
||||
#### VXLAN EVPN: SR Linux
|
||||
Nokia informs me that it uses a special interface called _system0_ to source its VXLAN traffic from.
|
||||
So it's a matter of defining that interface and associate a VXLAN interface with it, like so:
|
||||
|
||||
Nokia documentation informs me that SR Linux uses a special interface called _system0_ to source its
|
||||
VXLAN traffic from, and add the interface to the _default_ network-instance. So it's a matter of
|
||||
defining that interface and associate a VXLAN interface with it, like so:
|
||||
|
||||
```
|
||||
A:linuxadmin@nikhef# set / interface system0 admin-state enable
|
||||
@ -430,8 +445,8 @@ A:linuxadmin@nikhef# set protocols bgp-vpn bgp-instance 1 route-target import-rt
|
||||
A:linuxadmin@nikhef# commit stay
|
||||
```
|
||||
|
||||
In the first block here, I take what is a 100G port called `ethernet-1/9` and I split it into 4x25G
|
||||
ports. I'll force the port speed to 10G because Arend has taken a 40G-4x10G DAC, and it happens that
|
||||
In the first block here, Arend took what is a 100G port called `ethernet-1/9` and split it into 4x25G
|
||||
ports. Arend forced the port speed to 10G because he has taken a 40G-4x10G DAC, and it happens that
|
||||
the third lane is plugged into the Debian machine. So on `ethernet-1/9/3` I'll create a
|
||||
sub-interface, make it type _bridged_ (which I've also done on `vxlan1.2604`!) and allow any
|
||||
untagged traffic to enter it.
|
||||
@ -444,13 +459,13 @@ previous [[article]({{< ref 2022-02-14-vpp-vlan-gym.md >}})] which my buddy Fred
|
||||
_VLAN Gymnastics_ because the ports are just so damn flexible. Worth a read!
|
||||
|
||||
The second block creates a new _network-instance_ which I'll name `peeringlan`, and it associates
|
||||
the newly crated untagged sub-interface ethernet-1/9/3.0 with with the VXLAN interface, and starts a
|
||||
the newly crated untagged sub-interface `ethernet-1/9/3.0` with with the VXLAN interface, and starts a
|
||||
protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the
|
||||
VXLAN interface, and signalling of all MAC addresses learned to use route-distinguisher and
|
||||
import/export route-targets. For simplicity I've just used the same for each: 65500:2604.
|
||||
|
||||
I continue to add an interface to the `peeringlan` _network-instance_ on the other two Nokia
|
||||
routers: `ethernet-1/9/3.0` on the equinix router and `ethernet-1/9.0` on the nokia-leaf router.
|
||||
routers: `ethernet-1/9/3.0` on the _equinix_ router and `ethernet-1/9.0` on the _nokia-leaf_ router.
|
||||
Each of these goes to a 10Gbps port on a Debian machine.
|
||||
|
||||
#### VXLAN EVPN: Arista
|
||||
@ -476,7 +491,8 @@ interface Vxlan1
|
||||
|
||||
After creating VLAN 2604 on making port Eth9/3 an access port in that VLAN, I'll add a VTEP endpoint
|
||||
called `Loopback1`, and a VXLAN interface that uses that to source its traffic. Here, I'll associate
|
||||
local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias.
|
||||
local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias
|
||||
previously.
|
||||
|
||||
Finally, it's a matter of tying these together by announcing the MAC addresses into the EVPN iBGP
|
||||
sessions:
|
||||
@ -573,11 +589,11 @@ AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Li
|
||||
* ec RD: 65500:2604 imet 198.19.18.3
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0
|
||||
```
|
||||
There's a lot to unpack here! The Arista is seeing that from the _route-discriminator_ I configured
|
||||
There's a lot to unpack here! The Arista is seeing that from the _route-distinguisher_ I configured
|
||||
on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for
|
||||
the nokia-leaf router) from both iBGP sessions. The MAC address is learned from originator
|
||||
the _nokia-leaf_ router) from both iBGP sessions. The MAC address is learned from originator
|
||||
198.19.16.3 (the loopback of the nokia-leaf router), from two cluster members, the _active_ one on
|
||||
iBGP speaker 198.19.16.1 (nikhef) and a backup member on 198.19.16.0 (equinix).
|
||||
iBGP speaker 198.19.16.1 (_nikhef_) and a backup member on 198.19.16.0 (_equinix_).
|
||||
|
||||
I can also see that there's a bunch of `imet` route entries, and Andy explained these to me. They are
|
||||
a signal from a VTEP participant that they are interested in seeing multicast traffic (like neighbor
|
||||
@ -643,10 +659,10 @@ entries. One thing to note -- the SR Linux implementation leaves the type-2 rout
|
||||
|
||||
#### Results: Debian view
|
||||
|
||||
There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. Arend
|
||||
hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+ connections.
|
||||
This network card is a regular in my AS8298 network, as it has excellent DPDK support and can pump
|
||||
easily 40Mpps with VPP. IPng 🥰 Intel X710!
|
||||
There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. As I said,
|
||||
Arend hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+
|
||||
connections. This network card is a regular in my AS8298 network, as it has excellent DPDK support
|
||||
and can pump easily 40Mpps with VPP. IPng 🥰 Intel X710!
|
||||
|
||||
```
|
||||
root@debian:~ # ip netns add nikhef
|
||||
@ -755,3 +771,12 @@ as well as SR Linux, and in particular wanted to give a big "Thank you!" for hel
|
||||
symmetric and asymmetric IRB in the context of multivendor EVPN. Andy is about to start a new job at
|
||||
Nokia, and I wish him all the best. To my friends at Nokia: you caught a good one, Andy is pure
|
||||
gold!
|
||||
|
||||
I also want to thank Niek for helping me take my first baby steps onto this platform and patiently
|
||||
answering my nerdly questions about the platform, the switch chip, and the configuration philosophy.
|
||||
Learning a new NOS is always a fun task, and it was made super fun because Niek spent an hour with
|
||||
Arend and me on a video call, giving a bunch of operational tips and tricks along the way.
|
||||
|
||||
Finally, Arend and ERITAP are an absolute joy to work with. We took turns hacking on the lab, which
|
||||
Arend made available for me while I am traveling to Mississippi this week. Thanks for the kWh and
|
||||
OOB access, and for brainstorming the config with me!
|
||||
|
Reference in New Issue
Block a user