This commit is contained in:
@ -76,7 +76,7 @@ notably they don't have to do provider edge functionality like VXLAN encap and d
|
||||
Almost all of these designs are showing how one might build a leaf-spine network for hyperscale.
|
||||
|
||||
**Critique 1**: my 'spine' (IXR-7220-D4 routers) must also be provider edge. Practically speaking,
|
||||
in the picture above I have these beautiful Nokia IXR-7220-D4 switches, using two 400G ports to
|
||||
in the picture above I have these beautiful Nokia IXR-7220-D4 routers, using two 400G ports to
|
||||
connect between the facilities, and six 100G ports to connect the smaller breakout switches. That
|
||||
would leave a _massive_ amount of capacity unused: 22x 100G and 6x400G ports, to be exact.
|
||||
|
||||
@ -88,14 +88,15 @@ It's much more economical to create a star-topology that minimizes cross-datacen
|
||||
**Critique 3**: Most of these 'spine-leaf' reference architectures assume that the interior gateway
|
||||
protocol is eBGP in what they call the _underlay_, and on top of that, some secondary eBGP that's
|
||||
called the _overlay_. Frankly, such a design makes my head spin a little bit. These designs assume
|
||||
hundreds of switches, in which case making use of one AS number per switch could make sense (as iBGP
|
||||
needs either a 'full mesh', or external route reflectors).
|
||||
hundreds of switches, in which case making use of one AS number per switch could make sense, as iBGP
|
||||
needs either a 'full mesh', or external route reflectors.
|
||||
|
||||
**Critique 4**: These reference designs also make an assumption that all fiber is local and while
|
||||
links can fail, it will be relatively rare to _drain_ a link. However, in cross-datacenter networks,
|
||||
draining links for maintenance is very common, for example if the dark fiber provider needs to
|
||||
perform maintenance. With these eBGP-over-eBGP connections, traffic engineering is more difficult
|
||||
than simply raising the OSPF (or IS-IS) cost of a link, to reroute traffic.
|
||||
optics and links can fail, it will be relatively rare to _drain_ a link. However, in
|
||||
cross-datacenter networks, draining links for maintenance is very common, for example if the dark
|
||||
fiber provider needs to perform repairs on a span that was damaged. With these eBGP-over-eBGP
|
||||
connections, traffic engineering is more difficult than simply raising the OSPF (or IS-IS) cost of a
|
||||
link, to reroute traffic.
|
||||
|
||||
Setting aside eVPN for a second, if I were to build an IP transport network, like I did when I built
|
||||
[[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a much more intuitive
|
||||
@ -121,16 +122,16 @@ for the overlay! I have a feeling that some folks will dispise me for being cont
|
||||
leave your comments below, and don't forget to like-and-subscribe :-)
|
||||
|
||||
Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two
|
||||
400G-capable switches and connects them. Then he takes an Arista DCS-7060CX switch (which is eVPN
|
||||
capable, with 32x100G ports, based on the Broadcom Tomahawk3 chipset), and a smaller Nokia
|
||||
IXR-7220-D2 (with 48x25G and 8x100G ports, based on the Trident3 chipset). He wires all of this up
|
||||
400G-capable routers and connects them. Then he takes an Arista DCS-7060CX switch, which is eVPN
|
||||
capable, with 32x100G ports, based on the Broadcom Tomahawk3 chipset, and a smaller Nokia
|
||||
IXR-7220-D2 with 48x25G and 8x100G ports, based on the Trident3 chipset. He wires all of this up
|
||||
to look like the picture on the right.
|
||||
|
||||
#### Underlay: Nokia's SR Linux
|
||||
|
||||
We boot up the lab, verify that all the optics and links are up, and connect the management ports to
|
||||
an OOB network that I can remotely log in to. This is the first time that either of us work on
|
||||
Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek.
|
||||
We boot up the equipment, verify that all the optics and links are up, and connect the management
|
||||
ports to an OOB network that I can remotely log in to. This is the first time that either of us work
|
||||
on Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek.
|
||||
|
||||
```
|
||||
[pim@nikhef ~]$ sr_cli
|
||||
@ -181,7 +182,7 @@ PING 198.19.17.1 (198.19.17.1) 9162(9190) bytes of data.
|
||||
|
||||
#### Underlay: SR Linux OSPF
|
||||
|
||||
OK, let's get these two Nokia routers to speak OSPF, so that they can reach each others' loopbacks.
|
||||
OK, let's get these two Nokia routers to speak OSPF, so that they can reach each other's loopback.
|
||||
It's really easy:
|
||||
|
||||
```
|
||||
@ -195,11 +196,11 @@ A:pim@nikhef# set area 0.0.0.0 interface lo0.0 passive true
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
Similar to in JunOS, I can descend into a configuration scope (the first line goes into the
|
||||
Similar to in JunOS, I can descend into a configuration scope: the first line goes into the
|
||||
_network-instance_ called `default` and then the _protocols_ called `ospf`, and then the _instance_
|
||||
called `default`. Subsequent `set` commands operate at this scope. Once I commit this configuration
|
||||
(on the _nikhef_ router and also the _equinix_ router, with its own unique router-id), OSPF shoots
|
||||
to life immediately:
|
||||
(on the _nikhef_ router and also the _equinix_ router, with its own unique router-id), OSPF quickly
|
||||
shoots in action:
|
||||
|
||||
```
|
||||
A:pim@nikhef# show network-instance default protocols ospf neighbor
|
||||
@ -241,8 +242,8 @@ Delicious! OSPF has learned the loopback, and it is now reachable. As with most
|
||||
to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going
|
||||
from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on,
|
||||
going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on
|
||||
the _nikhef_ router, using ethernet-1/1.0 through ethernet-1/4.0 with the correct MTU and turning on OSPF
|
||||
for these), makes the whole network shoot to life. Slick!
|
||||
the _nikhef_ router, using `ethernet-1/1.0` through `ethernet-1/4.0` with the correct MTU and
|
||||
turning on OSPF for these), makes the whole network shoot to life. Slick!
|
||||
|
||||
#### Underlay: Arista
|
||||
|
||||
@ -277,13 +278,14 @@ router ospf 65500
|
||||
max-lsa 12000
|
||||
```
|
||||
|
||||
I complete the configuration for the other two core ports on this Arista, port Eth31/1 connects also
|
||||
I complete the configuration for the other two interfaces on this Arista, port Eth31/1 connects also
|
||||
to the _nikhef_ IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to
|
||||
the _nokia-leaf_ IXR-7220-D2 with a cost of 10.
|
||||
It's nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via
|
||||
router-id 198.19.16.1 (nikhef), and there's one lower cost path via router-id 198.19.16.3
|
||||
(nokia-leaf). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia ->
|
||||
equinix).
|
||||
(_nokia-leaf_). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia ->
|
||||
equinix). Dope!
|
||||
|
||||
```
|
||||
arista-leaf#show ip ospf nei
|
||||
Neighbor ID Instance VRF Pri State Dead Time Address Interface
|
||||
@ -304,13 +306,14 @@ high cost for now.
|
||||
|
||||
#### Overlay EVPN: SR Linux
|
||||
|
||||
The big-picture idea here is to use iBGP with the same AS number, and because there are two main
|
||||
facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as
|
||||
The big-picture idea here is to use iBGP with the same private AS number, and because there are two
|
||||
main facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as
|
||||
route-reflectors for others. It means that they will have an iBGP session amongst themselves
|
||||
(198.191.16.0 <-> 198.19.16.1) and otherwise accept iBGP sessions from any IP address in the
|
||||
198.19.16.0/24 subnet. This way, I don't have to configure any more than strictly necessary on the
|
||||
core routers. Any new router can just plug in, form an OSPF adjacency, and connect to both core
|
||||
routers. I proceed to configure BGP on the Nokia's like this:
|
||||
|
||||
```
|
||||
A:pim@nikhef# / network-instance default protocols bgp
|
||||
A:pim@nikhef# set admin-state enable
|
||||
@ -358,7 +361,7 @@ Summary:
|
||||
A few things to note here - there one _configured_ neighbor (this is the other IXR-7220-D4 router),
|
||||
and two _dynamic_ peers, these are the Arista and the smaller IXR-7220-D2 router. The only address
|
||||
family that they are exchanging information for is the _evpn_ family, and no prefixes have been
|
||||
learned or sent yet (that's the `[0/0/0]` designation in the last column).
|
||||
learned or sent yet, shown by the `[0/0/0]` designation in the last column.
|
||||
|
||||
#### Overlay EVPN: Arista
|
||||
|
||||
@ -400,7 +403,7 @@ Voila!
|
||||
#### VXLAN EVPN: SR Linux
|
||||
|
||||
Nokia documentation informs me that SR Linux uses a special interface called _system0_ to source its
|
||||
VXLAN traffic from, and add the interface to the _default_ network-instance. So it's a matter of
|
||||
VXLAN traffic from, and to add this interface to the _default_ network-instance. So it's a matter of
|
||||
defining that interface and associate a VXLAN interface with it, like so:
|
||||
|
||||
```
|
||||
@ -459,10 +462,11 @@ previous [[article]({{< ref 2022-02-14-vpp-vlan-gym.md >}})] which my buddy Fred
|
||||
_VLAN Gymnastics_ because the ports are just so damn flexible. Worth a read!
|
||||
|
||||
The second block creates a new _network-instance_ which I'll name `peeringlan`, and it associates
|
||||
the newly crated untagged sub-interface `ethernet-1/9/3.0` with with the VXLAN interface, and starts a
|
||||
the newly crated untagged sub-interface `ethernet-1/9/3.0` with the VXLAN interface, and starts a
|
||||
protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the
|
||||
VXLAN interface, and signalling of all MAC addresses learned to use route-distinguisher and
|
||||
import/export route-targets. For simplicity I've just used the same for each: 65500:2604.
|
||||
VXLAN sub-interface, and signalling of all MAC addresses learned to use the specified
|
||||
route-distinguisher and import/export route-targets. For simplicity I've just used the same for
|
||||
each: 65500:2604.
|
||||
|
||||
I continue to add an interface to the `peeringlan` _network-instance_ on the other two Nokia
|
||||
routers: `ethernet-1/9/3.0` on the _equinix_ router and `ethernet-1/9.0` on the _nokia-leaf_ router.
|
||||
@ -592,7 +596,7 @@ AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Li
|
||||
There's a lot to unpack here! The Arista is seeing that from the _route-distinguisher_ I configured
|
||||
on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for
|
||||
the _nokia-leaf_ router) from both iBGP sessions. The MAC address is learned from originator
|
||||
198.19.16.3 (the loopback of the nokia-leaf router), from two cluster members, the _active_ one on
|
||||
198.19.16.3 (the loopback of the _nokia-leaf_ router), from two cluster members, the active one on
|
||||
iBGP speaker 198.19.16.1 (_nikhef_) and a backup member on 198.19.16.0 (_equinix_).
|
||||
|
||||
I can also see that there's a bunch of `imet` route entries, and Andy explained these to me. They are
|
||||
@ -650,9 +654,9 @@ Type 3 Inclusive Multicast Ethernet Tag Routes
|
||||
--------------------------------------------------------------------------------------------------------------------------
|
||||
```
|
||||
|
||||
I have to say, SR Linux is incredibly chatty! But, I can see all the relevant bits and bobs here.
|
||||
Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch, one
|
||||
pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the IMET
|
||||
I have to say, SR Linux output is incredibly verbose! But, I can see all the relevant bits and bobs
|
||||
here. Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch,
|
||||
one pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the `imet`
|
||||
entries. One thing to note -- the SR Linux implementation leaves the type-2 routes empty with a
|
||||
0.0.0.0 IPv4 address, while the Arista (in my opinion, more correctly) leaves them as NULL
|
||||
(unspecified). But, everything looks great!
|
||||
@ -713,14 +717,14 @@ fe80::e63a:6eff:fe5f:c59 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
|
||||
2001:db8::12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
|
||||
```
|
||||
|
||||
The Debian machine puts each network card into its own network namespace, and gives it both an IPv4
|
||||
The Debian machine puts each network card into its own network namespace, and gives them both an IPv4
|
||||
and an IPv6 address. I can then enter the `nikhef` network namespace, which has its NIC connected to
|
||||
the IXR-7220-D4 router called _nikhef_, and ping all four endpoints. Similarly, I can enter the
|
||||
`arista-leaf` namespace and ping6 all four endpoints. Finally, I take a look at the IPv6 and IPv4
|
||||
neighbor table on the network card that is connected to the Equinix router. All three MAC addresses are
|
||||
seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability.
|
||||
neighbor table on the network card that is connected to the _equinix_ router. All three MAC addresses are
|
||||
seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability. Booyah!
|
||||
|
||||
Performance? We got that!
|
||||
Performance? We got that! I'm not worried as these Nokia routers are rated for 12.8Tbps of VXLAN....
|
||||
```
|
||||
root@debian:~# ip netns exec equinix iperf3 -c 192.0.2.12
|
||||
Connecting to host 192.0.2.12, port 5201
|
||||
@ -757,9 +761,9 @@ Notably:
|
||||
using the syntax of `protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise
|
||||
true`. This will glean the IP addresses based on intercepted ARP requests, and reduce the need for
|
||||
BUM flooding.
|
||||
* Andy informs me that Arista also has this feature. By setting 'router l2-vpn' and 'arp learning bridged',
|
||||
the suppression of ARP requests/replies also works in the same way. This greatly reduces cross
|
||||
router BUM flooding. If DE-CIX can do it, so can FrysIX :)
|
||||
* Andy informs me that Arista also has this feature. By setting `router l2-vpn` and `arp learning bridged`,
|
||||
the suppression of ARP requests/replies also works in the same way. This greatly reduces cross-router
|
||||
BUM flooding. If DE-CIX can do it, so can FrysIX :)
|
||||
* some automation - although configuring the MAC-VRF across Arista and SR Linux is definitely not
|
||||
as difficult as I thought, having some automation in place will avoid errors and mistakes. It
|
||||
would suck if the IXP collapsed because I botched a link drain or PNI configuration!
|
||||
|
Reference in New Issue
Block a user