830 lines
42 KiB
Markdown
830 lines
42 KiB
Markdown
---
|
||
date: "2023-11-11T13:31:00Z"
|
||
title: Debian on Mellanox SN2700 (32x100G)
|
||
aliases:
|
||
- /s/articles/2023/11/11/mellanox-sn2700.html
|
||
---
|
||
|
||
# Introduction
|
||
|
||
I'm still hunting for a set of machines with which I can generate 1Tbps and 1Gpps of VPP traffic,
|
||
and considering a 100G network interface can do at most 148.8Mpps, I will need 7 or 8 of these
|
||
network cards. Doing a loadtest like this with DACs back-to-back is definitely possible, but it's a
|
||
bit more convenient to connect them all to a switch. However, for this to work I would need (at
|
||
least) fourteen or more HundredGigabitEthernet ports, and these switches tend to get expensive, real
|
||
quick.
|
||
|
||
Or do they?
|
||
|
||
## Hardware
|
||
|
||
{{< image width="400px" float="right" src="/assets/mlxsw/sn2700-1.png" alt="SN2700" >}}
|
||
|
||
I thought I'd ask the #nlnog IRC channel for advice, and of course the usual suspects came past,
|
||
such as Juniper, Arista, and Cisco. But somebody mentioned "How about Mellanox, like SN2700?" and I
|
||
remembered my buddy Eric was a fan of those switches. I looked them up on the refurbished market and
|
||
I found one for EUR 1'400,- for 32x100G which felt suspiciously low priced... but I thought YOLO and
|
||
I ordered it. It arrived a few days later via UPS from Denmark to Switzerland.
|
||
|
||
The switch specs are pretty impressive, with 32x100G QSFP28 ports, which can be broken out to a set
|
||
of sub-ports (each of 1/10/25/50G), with a specified switch throughput of 6.4Tbps in 4.76Gpps, while
|
||
only consuming ~150W all-up.
|
||
|
||
Further digging revealed that the architecture of this switch consists of two main parts:
|
||
|
||
{{< image width="400px" float="right" src="/assets/mlxsw/sn2700-2.png" alt="SN2700" >}}
|
||
|
||
1. an AMD64 component with an mSATA disk to boot from, two e1000 network cards, and a single USB
|
||
and RJ45 serial port with standard pinout. It has a PCIe connection to a switch board in the
|
||
front of the chassis, further more it's equipped with 8GB of RAM in an SO-DIMM, and its CPU is a
|
||
two core Celeron(R) CPU 1047UE @ 1.40GHz.
|
||
|
||
1. the silicon used in this switch is called _Spectrum_ and identifies itself in Linux as PCI
|
||
device `03:00.0` called _Mellanox Technologies MT52100_, so the front dataplane with 32x100G is
|
||
separated from the Linux based controlplane.
|
||
|
||
{{< image width="400px" float="right" src="/assets/mlxsw/sn2700-3.png" alt="SN2700" >}}
|
||
|
||
When turning on the device, the serial port comes to life and shows me a BIOS, quickly after which
|
||
it jumps into GRUB2 and wants me to install it using something called ONIE. I've heard of that, but
|
||
now it's time for me to learn a little bit more about that stuff. I ask around and there's plenty of
|
||
ONIE images for this particular type of chip to be found - some are open source, some are semi-open
|
||
source (as in: were once available but now are behind paywalls etc).
|
||
|
||
Before messing around with the switch and possibly locking myself out or bricking it, I take out the
|
||
16GB mSATA and make a copy of it for safe keeping. I feel somewhat invincible by doing this. How bad
|
||
could I mess up this switch, if I can just copy back a bitwise backup of the 16GB mSATA? I'm about
|
||
to find out, so read on!
|
||
|
||
## Software
|
||
|
||
The Mellanox SN2700 switch is an ONIE (Open Network Install Environment) based platform that
|
||
supports a multitude of operating systems, as well as utilizing the advantages of Open Ethernet and
|
||
the capabilities of the Mellanox Spectrum® ASIC. The SN2700 has three modes of operation:
|
||
|
||
* Preinstalled with Mellanox Onyx (successor to MLNX-OS Ethernet), a home-grown operating system
|
||
utilizing common networking user experiences and industry standard CLI.
|
||
* Preinstalled with Cumulus Linux, a revolutionary operating system taking the Linux user experience
|
||
from servers to switches and providing a rich routing functionality for large scale applications.
|
||
* Provided with a bare ONIE image ready to be installed with the aforementioned or other ONIE-based
|
||
operating systems.
|
||
|
||
I asked around a bit more and found that there's a few more things one might do with this switch.
|
||
One of them is [[SONiC](https://github.com/sonic-net/SONiC)], which stands for _Software for Open
|
||
Networking in the Cloud_, and has support for the _Spectrum_ and notably the _SN2700_ switch. Cool!
|
||
|
||
I also learned about [[DENT](https://dent.dev)], which utilizes the Linux Kernel, Switchdev,
|
||
and other Linux based projects as the basis for building a new standardized network operating system
|
||
without abstractions or overhead. Unfortunately, while the _Spectrum_ chipset is known to DENT, this
|
||
particular layout on SN2700 is not supported.
|
||
|
||
Finally, my buddy `fall0ut` said "why not just Debian with switchdev?" and now my eyes opened wide.
|
||
I had not yet come across [[switchdev](https://docs.kernel.org/networking/switchdev.html)], which is
|
||
a standard Linux kernel driver model for switch devices which offload the forwarding (data)plane
|
||
from the kernel. As it turns out, Mellanox did a really good job writing a switchdev implementation
|
||
in the [[linux kernel](https://github.com/torvalds/linux/tree/master/drivers/net/ethernet/mellanox/mlxsw)]
|
||
for the Spectrum series of silicon, and it's all upstreamed to the Linux kernel. Wait, what?!
|
||
|
||
### Mellanox Switchdev
|
||
|
||
I start by reading the [[brochure](/assets/mlxsw/PB_Spectrum_Linux_Switch.pdf)], which shows me the
|
||
intentions Mellanox had when designing and marketing these switches. It seems that they really meant
|
||
it when they said this thing is a fully customizable Linux switch, check out this paragraph:
|
||
|
||
> Once the Mellanox Switchdev driver is loaded into the Linux Kernel, each
|
||
> of the switch’s physical ports is registered as a net_device within the
|
||
> kernel. Using standard Linux tools (for example, bridge, tc, iproute), ports
|
||
> can be bridged, bonded, tunneled, divided into VLANs, configured for L3
|
||
> routing and more. Linux switching and routing tables are reflected in the
|
||
> switch hardware. Network traffic is then handled directly by the switch.
|
||
> Standard Linux networking applications can be natively deployed and
|
||
> run on switchdev. This may include open source routing protocol stacks,
|
||
> such as Quagga, Bird and XORP, OpenFlow applications, or user-specific
|
||
> implementations.
|
||
|
||
### Installing Debian on SN2700
|
||
|
||
.. they had me at Bird :) so off I go, to install a vanilla Debian AMD64 Bookworm on a 120G mSATA I
|
||
had laying around. After installing it, I noticed that the coveted `mlxsw` driver is not shipped by
|
||
default on the Linux kernel image in Debian, so I decide to build my own, letting the [[Debian
|
||
docs](https://wiki.debian.org/BuildADebianKernelPackage)] take my hand and guide me through it.
|
||
|
||
I find a reference on the Mellanox [[GitHub
|
||
wiki](https://github.com/Mellanox/mlxsw/wiki/Installing-a-New-Kernel)] which shows me which kernel
|
||
modules to include to successfully use the _Spectrum_ under Linux, so I think I know what to do:
|
||
|
||
```
|
||
pim@summer:/usr/src$ sudo apt-get install build-essential linux-source bc kmod cpio flex \
|
||
libncurses5-dev libelf-dev libssl-dev dwarves bison
|
||
pim@summer:/usr/src$ sudo apt install linux-source-6.1
|
||
pim@summer:/usr/src$ sudo tar xf linux-source-6.1.tar.xz
|
||
pim@summer:/usr/src$ cd linux-source-6.1/
|
||
pim@summer:/usr/src/linux-source-6.1$ sudo cp /boot/config-6.1.0-12-amd64 .config
|
||
pim@summer:/usr/src/linux-source-6.1$ cat << EOF | sudo tee -a .config
|
||
CONFIG_NET_IPIP=m
|
||
CONFIG_NET_IPGRE_DEMUX=m
|
||
CONFIG_NET_IPGRE=m
|
||
CONFIG_IPV6_GRE=m
|
||
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
|
||
CONFIG_IP_MULTIPLE_TABLES=y
|
||
CONFIG_IPV6_MULTIPLE_TABLES=y
|
||
CONFIG_BRIDGE=m
|
||
CONFIG_VLAN_8021Q=m
|
||
CONFIG_BRIDGE_VLAN_FILTERING=y
|
||
CONFIG_BRIDGE_IGMP_SNOOPING=y
|
||
CONFIG_NET_SWITCHDEV=y
|
||
CONFIG_NET_DEVLINK=y
|
||
CONFIG_MLXFW=m
|
||
CONFIG_MLXSW_CORE=m
|
||
CONFIG_MLXSW_CORE_HWMON=y
|
||
CONFIG_MLXSW_CORE_THERMAL=y
|
||
CONFIG_MLXSW_PCI=m
|
||
CONFIG_MLXSW_I2C=m
|
||
CONFIG_MLXSW_MINIMAL=y
|
||
CONFIG_MLXSW_SWITCHX2=m
|
||
CONFIG_MLXSW_SPECTRUM=m
|
||
CONFIG_MLXSW_SPECTRUM_DCB=y
|
||
CONFIG_LEDS_MLXCPLD=m
|
||
CONFIG_NET_SCH_PRIO=m
|
||
CONFIG_NET_SCH_RED=m
|
||
CONFIG_NET_SCH_INGRESS=m
|
||
CONFIG_NET_CLS=y
|
||
CONFIG_NET_CLS_ACT=y
|
||
CONFIG_NET_ACT_MIRRED=m
|
||
CONFIG_NET_CLS_MATCHALL=m
|
||
CONFIG_NET_CLS_FLOWER=m
|
||
CONFIG_NET_ACT_GACT=m
|
||
CONFIG_NET_ACT_MIRRED=m
|
||
CONFIG_NET_ACT_SAMPLE=m
|
||
CONFIG_NET_ACT_VLAN=m
|
||
CONFIG_NET_L3_MASTER_DEV=y
|
||
CONFIG_NET_VRF=m
|
||
EOF
|
||
pim@summer:/usr/src/linux-source-6.1$ sudo make menuconfig
|
||
pim@summer:/usr/src/linux-source-6.1$ sudo make -j`nproc` bindeb-pkg
|
||
```
|
||
|
||
I run a gratuitous `make menuconfig` after adding all those config statements to the end of the
|
||
`.config` file, and it figures out how to combine what I wrote before with what was in the file
|
||
earlier, and I used the standard Bookworm 6.1 kernel config that came from the default installer, so
|
||
that it would be a minimal diff to what Debian itself shipped with.
|
||
|
||
After Summer stretches her legs a bit compiling this kernel for me, look at the result:
|
||
```
|
||
pim@summer:/usr/src$ dpkg -c linux-image-6.1.55_6.1.55-4_amd64.deb | grep mlxsw
|
||
drwxr-xr-x root/root 0 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/
|
||
-rw-r--r-- root/root 414897 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_core.ko
|
||
-rw-r--r-- root/root 19721 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_i2c.ko
|
||
-rw-r--r-- root/root 31817 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_minimal.ko
|
||
-rw-r--r-- root/root 65161 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_pci.ko
|
||
-rw-r--r-- root/root 1425065 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_spectrum.ko
|
||
```
|
||
|
||
Good job, Summer! On my mSATA disk, I tell Linux to boot its kernel using the following in GRUB,
|
||
which will make the kernel not create spiffy interface names like `enp6s0` or `eno1` but just
|
||
enumerate them all one by one and call them `eth0` and so on:
|
||
```
|
||
pim@fafo:~$ grep GRUB_CMDLINE /etc/default/grub
|
||
GRUB_CMDLINE_LINUX_DEFAULT=""
|
||
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 net.ifnames=0 biosdevname=0"
|
||
```
|
||
|
||
## Mellanox SN2700 running Debian+Switchdev
|
||
|
||
{{< image width="400px" float="right" src="/assets/mlxsw/debian.png" alt="Debian" >}}
|
||
|
||
I insert the freshly installed Debian Bookworm with custom compiled 6.1.55+mlxsw kernel into the
|
||
switch, and it boots on the first try. I see 34 (!) ethernet ports, noting that the first two come
|
||
from an Intel NIC but carrying a MAC address from Mellanox (starting with `0c:42:a1`) and the other
|
||
32 have a common MAC address (from Mellanox, starting with `04:3f:72`), and what I noticed is that
|
||
the MAC addresses here are skipping one between subsequent ports, which leads me to believe that
|
||
these 100G ports can be split into two (perhaps 2x50G, 2x40G, 2x25G, 2x10G, which I intend to find
|
||
out later). According to the official spec sheet, the switch allows 2-way breakout ports as well as
|
||
converter modules, to insert for example a 25G SFP28 into a QSFP28 switchport.
|
||
|
||
Honestly, I did not think I would get this far, so I humorously (at least, I think so) decide to
|
||
call this switch [[FAFO](https://www.urbandictionary.com/define.php?term=FAFO)].
|
||
|
||
First off, the `mlxsw` driver loaded:
|
||
```
|
||
root@fafo:~# lsmod | grep mlx
|
||
mlxsw_spectrum 708608 0
|
||
mlxsw_pci 36864 1 mlxsw_spectrum
|
||
mlxsw_core 217088 2 mlxsw_pci,mlxsw_spectrum
|
||
mlxfw 36864 1 mlxsw_core
|
||
vxlan 106496 1 mlxsw_spectrum
|
||
ip6_tunnel 45056 1 mlxsw_spectrum
|
||
objagg 53248 1 mlxsw_spectrum
|
||
psample 20480 1 mlxsw_spectrum
|
||
parman 16384 1 mlxsw_spectrum
|
||
bridge 311296 1 mlxsw_spectrum
|
||
```
|
||
|
||
I run `sensors-detect` and `pwmconfig`, let the fans calibrate and write their config file. The fans
|
||
come back down to a more chill (pun intended) speed, and I take a closer look. It seems all fans and
|
||
all thermometers, including the ones in the QSFP28 cages and the _Spectrum_ switch ASIC are
|
||
accounted for:
|
||
|
||
```
|
||
root@fafo:~# sensors
|
||
coretemp-isa-0000
|
||
Adapter: ISA adapter
|
||
Package id 0: +30.0°C (high = +87.0°C, crit = +105.0°C)
|
||
Core 0: +29.0°C (high = +87.0°C, crit = +105.0°C)
|
||
Core 1: +30.0°C (high = +87.0°C, crit = +105.0°C)
|
||
|
||
acpitz-acpi-0
|
||
Adapter: ACPI interface
|
||
temp1: +27.8°C (crit = +106.0°C)
|
||
temp2: +29.8°C (crit = +106.0°C)
|
||
|
||
mlxsw-pci-0300
|
||
Adapter: PCI adapter
|
||
fan1: 6239 RPM
|
||
fan2: 5378 RPM
|
||
fan3: 6268 RPM
|
||
fan4: 5378 RPM
|
||
fan5: 6326 RPM
|
||
fan6: 5442 RPM
|
||
fan7: 6268 RPM
|
||
fan8: 5315 RPM
|
||
temp1: +37.0°C (highest = +41.0°C)
|
||
front panel 001: +23.0°C (crit = +73.0°C, emerg = +75.0°C)
|
||
front panel 002: +24.0°C (crit = +73.0°C, emerg = +75.0°C)
|
||
front panel 003: +23.0°C (crit = +73.0°C, emerg = +75.0°C)
|
||
front panel 004: +26.0°C (crit = +73.0°C, emerg = +75.0°C)
|
||
...
|
||
```
|
||
|
||
From the top, first I see the classic CPU core temps, then an `ACPI` interface which I'm not quite
|
||
sure I understand the purpose of (possibly motherboard, but not PSU because pulling one out does not
|
||
change any values). Finally, the sensors using driver `mlxsw-pci-0300`, are those on the switch PCB
|
||
carrying the _Spectrum_ silicon, and there's a thermometer for each of the QSFP28 cages, possibly
|
||
reading from the optic, as most of them are empty except the first four which I inserted optics to.
|
||
Slick!
|
||
|
||
|
||
{{< image width="500px" float="right" src="/assets/mlxsw/debian-ethernet.png" alt="Ethernet" >}}
|
||
|
||
I notice that the ports are in a bit of a weird order. Firstly, eth0-1 are the two 1G ports on the
|
||
Debian machine. But then, the rest of the ports are the Mellanox Spectrum ASIC:
|
||
* eth2-17 correspond to port 17-32, which seems normal, but
|
||
* eth18-19 correspond to port 15-16
|
||
* eth20-21 correspond to port 13-14
|
||
* eth30-31 correspond to port 3-4
|
||
* eth32-33 correspond to port 1-2
|
||
|
||
The switchports are actually sequentially numbered with respect to MAC addresses, with `eth2`
|
||
starting at `04:3f:72:74:a9:41` and finally `eth34` having `04:3f:72:74:a9:7f` (for 64 consecutive
|
||
MACs).
|
||
|
||
Somehow though, the ports are wired in a different way on the front panel. As it turns out, I can
|
||
insert a little `udev` ruleset that will take care of this:
|
||
|
||
```
|
||
root@fafo:~# cat << EOF > /etc/udev/rules.d/10-local.rules
|
||
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="mlxsw_spectrum*", \
|
||
NAME="sw$attr{phys_port_name}"
|
||
EOF
|
||
```
|
||
|
||
After rebooting the switch, the ports are now called `swp1` .. `swp32` and they also correspond with
|
||
their physical ports on the front panel. One way to check this, is using `ethtool --identify swp1`
|
||
which will blink the LED of port 1, until I press ^C. Nice.
|
||
|
||
### Debian SN2700: Diagnostics
|
||
|
||
The first thing I'm curious to try, is if _Link Layer Discovery Protocol_
|
||
[[LLDP](https://en.wikipedia.org/wiki/Link_Layer_Discovery_Protocol)] works. This is a
|
||
vendor-neutral protocol that network devices use to advertise their identity to peers over Ethernet.
|
||
I install an open source LLDP daemon and plug in a DAC from port1 to a Centec switch in the lab.
|
||
|
||
And indeed, quickly after that, I see two devices, the first on the Linux machine `eth0` which is
|
||
the Unifi switch that has my LAN, and the second is the Centec behind `swp1`:
|
||
|
||
```
|
||
root@fafo:~# apt-get install lldpd
|
||
root@fafo:~# lldpcli show nei summary
|
||
-------------------------------------------------------------------------------
|
||
LLDP neighbors:
|
||
-------------------------------------------------------------------------------
|
||
Interface: eth0, via: LLDP
|
||
Chassis:
|
||
ChassisID: mac 44:d9:e7:05:ff:46
|
||
SysName: usw6-BasementServerroom
|
||
Port:
|
||
PortID: local Port 9
|
||
PortDescr: fafo.lab
|
||
TTL: 120
|
||
Interface: swp1, via: LLDP
|
||
Chassis:
|
||
ChassisID: mac 60:76:23:00:01:ea
|
||
SysName: sw3.lab
|
||
Port:
|
||
PortID: ifname eth-0-25
|
||
PortDescr: eth-0-25
|
||
TTL: 120
|
||
```
|
||
|
||
With this I learn that the switch forwards these datagrams (ethernet type `0x88CC`) from the
|
||
dataplane to the Linux controlplane. I would call this _punting_ in VPP language, but switchdev
|
||
calls it _trapping_, and I can see the LLDP packets when tcpdumping on ethernet device `swp1`.
|
||
So today I learned how to trap packets :-)
|
||
|
||
### Debian SN2700: ethtool
|
||
|
||
One popular diagnostics tool that is useful (and, hopefully well known because it's awesome),
|
||
is`ethtool`, a command-line tool in Linux for managing network interface devices. It allows me to
|
||
modify the parameters of the ports and their transceivers, as well as query the information of those
|
||
devices.
|
||
|
||
Here are few common examples, all of which work on this switch running Debian:
|
||
|
||
* `ethtool swp1`: Shows link capabilities (eg, 1G/10G/25G/40G/100G)
|
||
* `ethtool -s swp1 speed 40000 duplex full autoneg off`: Force speed/duplex
|
||
* `ethtool -m swp1`: Shows transceiver diagnostics like SFP+ light levels, link levels
|
||
(also `--module-info`)
|
||
* `ethtool -p swp1`: Flashes the transceiver port LED (also `--identify`)
|
||
* `ethtool -S swp1`: Shows packet and octet counters, and sizes, discards, errors, and so on
|
||
(also `--statistics`)
|
||
|
||
I specifically love the _digital diagnostics monitoring_ (DDM), originally specified in
|
||
[[SFF-8472](https://members.snia.org/document/dl/25916)], which allows me to read the EEPROM of
|
||
optical transceivers and get all sorts of critical diagnostics. I wish DPDK and VPP had that!
|
||
|
||
### Debian SN2700: devlink
|
||
|
||
In reading up on the _switchdev_ ecosystem, I stumbled across `devlink`, an API to expose device
|
||
information and resources not directly related to any device class, such as switch ASIC
|
||
configuration. As a fun fact, devlink was written by the same engineer who wrote the `mlxsw` driver
|
||
for Linux, Jiří Pírko. Its documentation can be found in the [[linux
|
||
kernel](https://docs.kernel.org/networking/devlink/index.html)], and it ships with any modern
|
||
`iproute2` distribution. The specific (somewhat terse) documentation of the `mlxsw` driver
|
||
[[lives there](https://docs.kernel.org/networking/devlink/mlxsw.html)] as well.
|
||
|
||
There's a lot to explore here, but I'll focus my attention to three things:
|
||
|
||
#### 1. devlink resource
|
||
|
||
When learning that the switch also does IPv4 and IPv6 routing, I immediately thought: how many
|
||
prefixes can be offloaded to the ASIC? One way to find out is to query what types of _resources_ it
|
||
has:
|
||
|
||
```
|
||
root@fafo:~# devlink resource show pci/0000:03:00.0
|
||
pci/0000:03:00.0:
|
||
name kvd size 258048 unit entry dpipe_tables none
|
||
resources:
|
||
name linear size 98304 occ 1 unit entry size_min 0 size_max 159744 size_gran 128 dpipe_tables none
|
||
resources:
|
||
name singles size 16384 occ 1 unit entry size_min 0 size_max 159744 size_gran 1 dpipe_tables none
|
||
name chunks size 49152 occ 0 unit entry size_min 0 size_max 159744 size_gran 32 dpipe_tables none
|
||
name large_chunks size 32768 occ 0 unit entry size_min 0 size_max 159744 size_gran 512 dpipe_tables none
|
||
name hash_double size 65408 unit entry size_min 32768 size_max 192512 size_gran 128 dpipe_tables none
|
||
name hash_single size 94336 unit entry size_min 65536 size_max 225280 size_gran 128 dpipe_tables none
|
||
name span_agents size 3 occ 0 unit entry dpipe_tables none
|
||
name counters size 32000 occ 4 unit entry dpipe_tables none
|
||
resources:
|
||
name rif size 8192 occ 0 unit entry dpipe_tables none
|
||
name flow size 23808 occ 4 unit entry dpipe_tables none
|
||
name global_policers size 1000 unit entry dpipe_tables none
|
||
resources:
|
||
name single_rate_policers size 968 occ 0 unit entry dpipe_tables none
|
||
name rif_mac_profiles size 1 occ 0 unit entry dpipe_tables none
|
||
name rifs size 1000 occ 1 unit entry dpipe_tables none
|
||
name physical_ports size 64 occ 36 unit entry dpipe_tables none
|
||
```
|
||
|
||
There's a lot to unpack here, but this is a tree of resources, each with names and children. Let me
|
||
focus on the first one, called `kvd`, which stands for _Key Value Database_ (in other words, a set
|
||
of lookup tables). It contains a bunch of children called `linear`, `hash_double` and `hash_single`.
|
||
The kernel [[docs](https://www.kernel.org/doc/Documentation/ABI/testing/devlink-resource-mlxsw)]
|
||
explain it in more detail, but this is where the switch will keep its FIB in _Content Addressable
|
||
Memory_ (CAM) of certain types of elements of a given length and count. All up, the size is 252KB,
|
||
which is not huge, but also certainly not tiny!
|
||
|
||
Here I learn that it's subdivided into:
|
||
* ***linear***: 96KB bytes of flat memory using an index, further divided into regions:
|
||
* ***singles***: 16KB of size 1, nexthops
|
||
* ***chunks***: 48KB of size 32, multipath routes with <32 entries
|
||
* ***large_chunks***: 32KB of size 512, multipath routes with <512 entries
|
||
* ***hash_single***: 92KB bytes of hash table for keys smaller than 64 bits (eg. L2 FIB, IPv4 FIB and
|
||
neighbors)
|
||
* ***hash_double***: 63KB bytes of hash table for keys larger than 64 bits (eg. IPv6 FIB and
|
||
neighbors)
|
||
|
||
|
||
#### 2. devlink dpipe
|
||
|
||
Now that I know the memory layout and regions of the CAM, I can start making some guesses on the FIB
|
||
size. The devlink pipeline debug API (DPIPE) is aimed at providing the user visibility into the
|
||
ASIC's pipeline in a generic way. The API is described in detail in the [[kernel
|
||
docs](https://docs.kernel.org/networking/devlink/devlink-dpipe.html)]. I feel free to take a peek at
|
||
the dataplane configuration innards:
|
||
|
||
```
|
||
root@fafo:~# devlink dpipe table show pci/0000:03:00.0
|
||
pci/0000:03:00.0:
|
||
name mlxsw_erif size 1000 counters_enabled false
|
||
match:
|
||
type field_exact header mlxsw_meta field erif_port mapping ifindex
|
||
action:
|
||
type field_modify header mlxsw_meta field l3_forward
|
||
type field_modify header mlxsw_meta field l3_drop
|
||
name mlxsw_host4 size 0 counters_enabled false resource_path /kvd/hash_single resource_units 1
|
||
match:
|
||
type field_exact header mlxsw_meta field erif_port mapping ifindex
|
||
type field_exact header ipv4 field destination ip
|
||
action:
|
||
type field_modify header ethernet field destination mac
|
||
name mlxsw_host6 size 0 counters_enabled false resource_path /kvd/hash_double resource_units 2
|
||
match:
|
||
type field_exact header mlxsw_meta field erif_port mapping ifindex
|
||
type field_exact header ipv6 field destination ip
|
||
action:
|
||
type field_modify header ethernet field destination mac
|
||
name mlxsw_adj size 0 counters_enabled false resource_path /kvd/linear resource_units 1
|
||
match:
|
||
type field_exact header mlxsw_meta field adj_index
|
||
type field_exact header mlxsw_meta field adj_size
|
||
type field_exact header mlxsw_meta field adj_hash_index
|
||
action:
|
||
type field_modify header ethernet field destination mac
|
||
type field_modify header mlxsw_meta field erif_port mapping ifindex
|
||
```
|
||
|
||
From this I can puzzle together how the CAM is _actually_ used:
|
||
|
||
* ***mlxsw_host4***: matches on the interface port and IPv4 destination IP, using `hash_single`
|
||
above with one unit for each entry, and when looking that up, puts the result into the ethernet
|
||
destination MAC (in other words, the FIB entry points at an L2 nexthop!)
|
||
* ***mlxsw_host6***: matches on the interface port and IPv6 destination IP using `hash_double`
|
||
with two units for each entry.
|
||
* ***mlxsw_adj***: holds the L2 adjacencies, and the lookup key is an index, size and hash index,
|
||
where the returned value is used to rewrite the destination MAC and select the egress port!
|
||
|
||
|
||
Now that I know the types of tables and what they are matching on (and then which action they are
|
||
performing), I can also take a look at the _actual data_ in the FIB. For example, if I create an IPv4
|
||
interface on the switch and ping a member on directly connected network there, I can see an entry
|
||
show up in the L2 adjacency table, like so:
|
||
|
||
```
|
||
root@fafo:~# ip addr add 100.65.1.1/30 dev swp31
|
||
root@fafo:~# ping 100.65.1.2
|
||
root@fafo:~# devlink dpipe table dump pci/0000:03:00.0 name mlxsw_host4
|
||
pci/0000:03:00.0:
|
||
index 0
|
||
match_value:
|
||
type field_exact header mlxsw_meta field erif_port mapping ifindex mapping_value 71 value 1
|
||
type field_exact header ipv4 field destination ip value 100.65.1.2
|
||
action_value:
|
||
type field_modify header ethernet field destination mac value b4:96:91:b3:b1:10
|
||
```
|
||
|
||
To decypher what the switch is doing: if the ifindex is `71` (which corresponds to `swp31`), and the
|
||
IPv4 destination IP address is `100.65.1.2`, then the destination MAC address will be set to
|
||
`b4:96:91:b3:b1:10`, so the switch knows where to send this ethernet datagram.
|
||
|
||
And now I have found what I need to know to be able to answer the question of the FIB size. This
|
||
switch can take **92K IPv4 routes** and **31.5K IPv6 routes**, and I can even inspect the FIB in great
|
||
detail. Rock on!
|
||
|
||
#### 3. devlink port split
|
||
|
||
But reading the switch chip configuration and FIB is not all that `devlink` can do, it can also make
|
||
changes! One particularly interesting one is the ability to split and unsplit ports. What this means
|
||
is that, when you take a 100Gbit port, it internally is divided into four so-called _lanes_ of
|
||
25Gbit each, where a 40Gbit port is internally divided into four _lanes_ of 10Gbit each. Splitting
|
||
ports is the act of taking such a port and reconfiguring its lanes.
|
||
|
||
Let me show you, by means of example, what spliting the first two switchports might look like. They
|
||
begin their life as 100G ports, which support a number of link speeds, notably: 100G, 50G, 25G, but
|
||
also 40G, 10G, and finally 1G:
|
||
|
||
```
|
||
root@fafo:~# ethtool swp1
|
||
Settings for swp1:
|
||
Supported ports: [ FIBRE ]
|
||
Supported link modes: 1000baseKX/Full
|
||
10000baseKR/Full
|
||
40000baseCR4/Full
|
||
40000baseSR4/Full
|
||
40000baseLR4/Full
|
||
25000baseCR/Full
|
||
25000baseSR/Full
|
||
50000baseCR2/Full
|
||
100000baseSR4/Full
|
||
100000baseCR4/Full
|
||
100000baseLR4_ER4/Full
|
||
|
||
root@fafo:~# devlink port show | grep 'swp[12] '
|
||
pci/0000:03:00.0/61: type eth netdev swp1 flavour physical port 1 splittable true lanes 4
|
||
pci/0000:03:00.0/63: type eth netdev swp2 flavour physical port 2 splittable true lanes 4
|
||
root@fafo:~# devlink port split pci/0000:03:00.0/61 count 4
|
||
[ 629.593819] mlxsw_spectrum 0000:03:00.0 swp1: link down
|
||
[ 629.722731] mlxsw_spectrum 0000:03:00.0 swp2: link down
|
||
[ 630.049709] mlxsw_spectrum 0000:03:00.0: EMAD retries (1/5) (tid=64b1a5870000c726)
|
||
[ 630.092179] mlxsw_spectrum 0000:03:00.0 swp1s0: renamed from eth2
|
||
[ 630.148860] mlxsw_spectrum 0000:03:00.0 swp1s1: renamed from eth2
|
||
[ 630.375401] mlxsw_spectrum 0000:03:00.0 swp1s2: renamed from eth2
|
||
[ 630.375401] mlxsw_spectrum 0000:03:00.0 swp1s3: renamed from eth2
|
||
|
||
root@fafo:~# ethtool swp1s0
|
||
Settings for swp1s0:
|
||
Supported ports: [ FIBRE ]
|
||
Supported link modes: 1000baseKX/Full
|
||
10000baseKR/Full
|
||
25000baseCR/Full
|
||
25000baseSR/Full
|
||
```
|
||
|
||
Whoa, what just happened here? The switch took the port defined by `pci/0000:03:00.0/61` which says
|
||
it is _splittable_ and has four lanes, and split it into four NEW ports called `swp1s0`-`swp1s3`,
|
||
and the resulting ports are 25G, 10G or 1G.
|
||
|
||
{{< image width="100px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||
|
||
However, I make an important observation. When splitting `swp1` in 4, the switch also removed port
|
||
`swp2`, and remember at the beginning of this article I mentioned that the MAC addresses seemed to
|
||
skip one entry between subsequent interfaces? Now I understand why: when spltting the port into two,
|
||
it will use the second MAC address for the second 50G port; but if I split it into four, it'll use
|
||
the MAC addresses from the adjacent port and decommission it. In other words: this switch can do
|
||
32x100G, or 64x50G, or 64x25G/10G/1G.
|
||
|
||
It doesn't matter which of the PCI interfaces I split on. The operation is also reversible, I can
|
||
issue `devlink port unsplit` to return the port to its aggregate state (eg. 4 lanes and 100Gbit),
|
||
which will remove the `swp1s0-3` ports and put back `swp1` and `swp2` again.
|
||
|
||
What I find particularly impressive about this, is that for most hardware vendors, this splitting of
|
||
ports requires a reboot of the chassis, while here it can happen entirely online. Well done,
|
||
Mellanox!
|
||
|
||
## Performance
|
||
|
||
OK, so this all seems to work, but does it work _well_? If you're a reader of my blog you'll know
|
||
that I love doing loadtests, so I boot my machine, Hippo, and I connect it with two 100G DACs to the
|
||
switch on ports 31 and 32:
|
||
|
||
```
|
||
[ 1.354802] ice 0000:0c:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
|
||
[ 1.447677] ice 0000:0c:00.0: firmware: direct-loading firmware intel/ice/ddp/ice.pkg
|
||
[ 1.561979] ice 0000:0c:00.1: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
|
||
[ 7.738198] ice 0000:0c:00.0 enp12s0f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC,
|
||
Negotiated FEC: RS-FEC, Autoneg Advertised: On, Autoneg Negotiated: True, Flow Control: None
|
||
[ 7.802572] ice 0000:0c:00.1 enp12s0f1: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC,
|
||
Negotiated FEC: RS-FEC, Autoneg Advertised: On, Autoneg Negotiated: True, Flow Control: None
|
||
```
|
||
|
||
I hope you're hungry, Hippo, cuz you're about to get fed!
|
||
|
||
### Debian SN2700: L2
|
||
|
||
To use the switch in L2 mode, I intuitively create Linux bridge, say `br0`, and add ports to that.
|
||
From the Mellanox documentation I learn that there can be multiple bridges, each isolated from one
|
||
another, but there can only be _one_ such bridge with `vlan_filtering` set. VLAN Filtering allows
|
||
the switch to only accept tagged frames from a list of _configured_ VLANs, and drop the rest. This
|
||
is what you'd imagine a regular commercial switch would provide.
|
||
|
||
So off I go, creating the bridge in which I'll add two ports (HundredGigabitEthernet port swp31 and
|
||
swp32), and I will allow for the maximum MTU size of 9216, also known as [[Jumbo Frames](https://en.wikipedia.org/wiki/Jumbo_frame)].
|
||
|
||
```
|
||
root@fafo:~# ip link add name br0 type bridge
|
||
root@fafo:~# ip link set br0 type bridge vlan_filtering 1 mtu 9216 up
|
||
root@fafo:~# ip link set swp31 mtu 9216 master br0 up
|
||
root@fafo:~# ip link set swp32 mtu 9216 master br0 up
|
||
```
|
||
|
||
These two ports are now _access_ ports, that is to say they accept and emit only untagged traffic,
|
||
and due to the `vlan_filtering` flag, they will drop all other frames. Using the standard `bridge`
|
||
utility from Linux, I can manipulate the VLANs on these ports.
|
||
|
||
First, I'll remove the _default VLAN_ and add VLAN 1234 to both ports, specifying that VLAN 1234 is
|
||
the so-called _Port VLAN ID_ (pvid). This makes them the equivalent of Cisco's _switchport access
|
||
1234_:
|
||
|
||
```
|
||
root@fafo:~# bridge vlan del vid 1 dev swp1
|
||
root@fafo:~# bridge vlan del vid 1 dev swp2
|
||
root@fafo:~# bridge vlan add vid 1234 dev swp1 pvid
|
||
root@fafo:~# bridge vlan add vid 1234 dev swp2 pvid
|
||
```
|
||
Then, I'll add a few tagged VLANs to the ports, so that they become the Cisco equivalent of a
|
||
_trunk_ port allowing these tagged VLANs and assuming untagged traffic is still VLAN 1234:
|
||
|
||
```
|
||
root@fafo:~# for port in swp1 swp2; do for vlan in 100 200 300 400; do \
|
||
bridge vlan add vid $vlan dev $port; done; done
|
||
root@fafo:~# bridge vlan
|
||
port vlan-id
|
||
swp1 100 200 300 400
|
||
1234 PVID
|
||
swp2 100 200 300 400
|
||
1234 PVID
|
||
br0 1 PVID Egress Untagged
|
||
```
|
||
|
||
When these commands are run against the interfaces `swp*`, they are picked up by the `mlxsw` kernel
|
||
driver, and transmitted to the _Spectrum_ switch chip, in other words, these commands end up
|
||
programming the silicon. Traffic through these switch ports on the front, rarely (if ever) get
|
||
forwarded to the Linux kernel, very similar to [[VPP](https://fd.io/)], the traffic stays mostly in
|
||
the dataplane. Some traffic, such as LLDP (and as we'll see later, IPv4 ARP and IPv6 neighbor
|
||
discovery), will be forwarded from the switch chip over the PCIe link to the kernel, after which the
|
||
results are transmitted back via PCIe to program the switch chip L2/L3 _Forwarding Information Base_
|
||
(FIB).
|
||
|
||
Now I turn my attention to the loadtest, by configuring T-Rex in L2 Stateless mode. I start a
|
||
bidirectional loadtest with 256b packets at 50% of line rate, which looks just fine:
|
||
|
||
{{< image src="/assets/mlxsw/trex1.png" alt="Trex L2" >}}
|
||
|
||
At this point I can already conclude that this is all happening in the dataplane, as the _Spectrum_
|
||
switch is connected to the Debian machine using a PCIe v3.0 x8 link, which is even obscured by
|
||
another device on the PCIe bus, so the Debian kernel is in no way able to process more than a token
|
||
amount of traffic, and yet I'm seeing 100Gbit go through the switch chip and the CPU load on the
|
||
kernel pretty much zero. I can however retrieve the link statistics using `ip stats`, and those will
|
||
show me the actual counters of the silicon, not _just_ the trapped packets. If you'll recall, in VPP
|
||
the only packets that the TAP interfaces see are those packets that are _punted_, and the Linux
|
||
kernel there is completely oblivious to the total dataplane throughput. Here, the interface is
|
||
showing the correct dataplane packet and byte counters, which means that things like SNMP will
|
||
automatically just do the right thing.
|
||
|
||
```
|
||
root@fafo:~# dmesg | grep 03:00.*bandwidth
|
||
[ 2.180410] pci 0000:03:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s
|
||
PCIe x4 link at 0000:00:01.2 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link)
|
||
|
||
root@fafo:~# uptime
|
||
03:19:16 up 2 days, 14:14, 1 user, load average: 0.00, 0.00, 0.00
|
||
|
||
root@fafo:~# ip stats show dev swp32 group link
|
||
72: swp32: group link
|
||
RX: bytes packets errors dropped missed mcast
|
||
5106713943502 15175926564 0 0 0 103
|
||
TX: bytes packets errors dropped carrier collsns
|
||
23464859508367 103495791750 0 0 0 0
|
||
```
|
||
|
||
### Debian SN2700: IPv4 and IPv6
|
||
|
||
I now take a look at the L3 capabilities of the switch. To do this, I simply destroy the bridge
|
||
`br0`, which will return the enslaved switchports. I then convert the T-Rex loadtester to use an L3
|
||
profile, and configure the switch as follows:
|
||
|
||
```
|
||
root@fafo:~# ip addr add 100.65.1.1/30 dev swp31
|
||
root@fafo:~# ip nei replace 100.65.1.2 lladdr b4:96:91:b3:b1:10 dev swp31
|
||
root@fafo:~# ip ro add 16.0.0.0/8 via 100.65.1.2 dev swp31
|
||
|
||
root@fafo:~# ip addr add 100.65.2.1/30 dev swp32
|
||
root@fafo:~# ip nei replace 100.65.2.2 lladdr b4:96:91:b3:b1:11 dev swp32
|
||
root@fafo:~# ip ro add 48.0.0.0/8 via 100.65.2.2 dev swp32
|
||
```
|
||
Several other routers I've loadtested have the same (cosmetic) issue, that T-Rex doesn't reply to
|
||
ARP packets after the first few seconds, so I first set the IPv4 address, then add a static L2
|
||
adjacency for the T-Rex side (on MAC `b4:96:91:b3:b1:10`), and route 16.0.0.0/8 to port 0 and I
|
||
route 48.0.0.0/8 to port 1 of the loadtester.
|
||
|
||
{{< image src="/assets/mlxsw/trex2.png" alt="Trex L3" >}}
|
||
|
||
I start a stateless L3 loadtest with 192 byte packets in both directions, and the switch keeps up
|
||
just fine. Taking a closer look at the `ip stats` instrumentation, I see that there's the ability to
|
||
turn on L3 counters in addition to L2 (ethernet) counters. So I do that on my two router ports while
|
||
they are happily forwarding 58.9Mpps, and I can now see the difference between dataplane (forwarded in
|
||
hardware) and CPU (forwarded by the CPU)
|
||
|
||
```
|
||
root@fafo:~# ip stats set dev swp31 l3_stats on
|
||
root@fafo:~# ip stats set dev swp32 l3_stats on
|
||
root@fafo:~# ip stats show dev swp32 group offload subgroup l3_stats
|
||
72: swp32: group offload subgroup l3_stats on used on
|
||
RX: bytes packets errors dropped mcast
|
||
270222574848200 1137559577576 0 0 0
|
||
TX: bytes packets errors dropped
|
||
281073635911430 1196677185749 0 0
|
||
root@fafo:~# ip stats show dev swp32 group offload subgroup cpu_hit
|
||
72: swp32: group offload subgroup cpu_hit
|
||
RX: bytes packets errors dropped missed mcast
|
||
1068742 17810 0 0 0 0
|
||
TX: bytes packets errors dropped carrier collsns
|
||
468546 2191 0 0 0 0
|
||
```
|
||
|
||
The statistics above clearly demonstrate that the lion's share of the packets have been forwarded by
|
||
the ASIC, and only a few (notably things like IPv6 neighbor discovery, IPv4 ARP, LLDP, and of course
|
||
any traffic _to_ the IP addresses configured on the router) will go to the kernel.
|
||
|
||
#### Debian SN2700: BVI (or VLAN Interfaces)
|
||
|
||
I've played around a little bit with L2 (switch) and L3 (router) ports, but there is one middle
|
||
ground. I'll keep the T-Rex loadtest running in L3 mode, but now I'll reconfigure the switch to put
|
||
the ports back into the bridge, each port in its own VLAN, and have so-called _Bridge Virtual
|
||
Interface_, also known as VLAN interfaces -- this is where the switch has a bunch of ports together
|
||
in a VLAN, but the switch itself has an IPv4 or IPv6 address in that VLAN as well, which can act as
|
||
a router.
|
||
|
||
|
||
I reconfigure the switch to put the interfaces back into VLAN 1000 and 2000 respectively, and move
|
||
the IPv4 addresses and routes there -- so here I go, first putting the switch interfaces back into
|
||
L2 mode and adding them to the bridge, each in their own VLAN, by making them access ports:
|
||
|
||
```
|
||
root@fafo:~# ip link add name br0 type bridge vlan_filtering 1
|
||
root@fafo:~# ip link set br0 address 04:3f:72:74:a9:7d mtu 9216 up
|
||
root@fafo:~# ip link set swp31 master br0 mtu 9216 up
|
||
root@fafo:~# ip link set swp32 master br0 mtu 9216 up
|
||
root@fafo:~# bridge vlan del vid 1 dev swp31
|
||
root@fafo:~# bridge vlan del vid 1 dev swp32
|
||
root@fafo:~# bridge vlan add vid 1000 dev swp31 pvid
|
||
root@fafo:~# bridge vlan add vid 2000 dev swp32 pvid
|
||
```
|
||
|
||
From the ASIC specs, I understand that these BVIs need to (re)use a MAC from one of the members, so
|
||
the first thing I do is give `br0` the right MAC address. Then I put the switch ports into the bridge,
|
||
remove VLAN 1 and put them in their respective VLANs. At this point, the loadtester reports 100%
|
||
packet loss, because the two ports can no longer see each other at layer2, and layer3 configs have
|
||
been removed. But I can restore connectivity with two _BVIs_ as follows:
|
||
|
||
```
|
||
root@fafo:~# for vlan in 1000 2000; do
|
||
ip link add link br0 name br0.$vlan type vlan id $vlan
|
||
bridge vlan add dev br0 vid $vlan self
|
||
ip link set br0.$vlan up mtu 9216
|
||
done
|
||
|
||
root@fafo:~# ip addr add 100.65.1.1/24 dev br0.1000
|
||
root@fafo:~# ip ro add 16.0.0.0/8 via 100.65.1.2
|
||
root@fafo:~# ip nei replace 100.65.1.2 lladdr b4:96:91:b3:b1:10 dev br0.1000
|
||
|
||
root@fafo:~# ip addr add 100.65.2.1/24 dev br0.2000
|
||
root@fafo:~# ip ro add 48.0.0.0/8 via 100.65.2.2
|
||
root@fafo:~# ip nei replace 100.65.2.2 lladdr b4:96:91:b3:b1:11 dev br0.2000
|
||
```
|
||
|
||
And with that, the loadtest shoots back in action:
|
||
{{< image src="/assets/mlxsw/trex3.png" alt="Trex L3 BVI" >}}
|
||
|
||
First a quick overview of the sitation I have created:
|
||
|
||
```
|
||
root@fafo:~# bridge vlan
|
||
port vlan-id
|
||
swp31 1000 PVID
|
||
swp32 2000 PVID
|
||
br0 1 PVID Egress Untagged
|
||
|
||
root@fafo:~# ip -4 ro
|
||
default via 198.19.5.1 dev eth0 onlink rt_trap
|
||
16.0.0.0/8 via 100.65.1.2 dev br0.1000 offload rt_offload
|
||
48.0.0.0/8 via 100.65.2.2 dev br0.2000 offload rt_offload
|
||
100.65.1.0/24 dev br0.1000 proto kernel scope link src 100.65.1.1 rt_offload
|
||
100.65.2.0/24 dev br0.2000 proto kernel scope link src 100.65.2.1 rt_offload
|
||
198.19.5.0/26 dev eth0 proto kernel scope link src 198.19.5.62 rt_trap
|
||
|
||
root@fafo:~# ip -4 nei
|
||
198.19.5.1 dev eth0 lladdr 00:1e:08:26:ec:f3 REACHABLE
|
||
100.65.1.2 dev br0.1000 lladdr b4:96:91:b3:b1:10 offload PERMANENT
|
||
100.65.2.2 dev br0.2000 lladdr b4:96:91:b3:b1:11 offload PERMANENT
|
||
```
|
||
|
||
Looking at the situation now, compared to the regular IPv4 L3 loadtest, there is one important
|
||
difference. Now, the switch can have any number of ports in VLAN 1000, which will all amongst
|
||
themselves do L2 forwarding at line rate, and when they need to send IPv4 traffic out, they will ARP
|
||
for the gateway (for example at `100.65.1.1/24`), which will get _trapped_ and forwarded to the CPU,
|
||
after which the ARP reply will go out so that the machines know where to find the gateway. From that
|
||
point on, IPv4 forwarding happens once again in hardware, which can be shown by the keywords
|
||
`rt_offload` in the routing table (`br0`, in the ASIC), compared to the `rt_trap` (`eth0`, in the
|
||
kernel). Similarly for the IPv4 neighbors, the L2 adjacency is programmed into the CAM (the output
|
||
of which I took a look at above), do forwarding can be done directly by the ASIC without
|
||
intervention from the CPU.
|
||
|
||
As a result, these _VLAN Interfaces_ (which are synonymous with _BVIs_), work at line rate out of
|
||
the box.
|
||
|
||
### Results
|
||
|
||
This switch is phenomenal, and Jiří Pírko and the Mellanox team truly outdid themselves with their `mlxsw`
|
||
switchdev implementation. I have in my hands a very affordable 32x100G or 64x(50G, 25G, 10G, 1G) and
|
||
anything in between, with IPv4 and IPv6 forwarding in hardware, with a limited FIB size, not too
|
||
dissimilar from the [[Centec]({{< ref "2022-12-09-oem-switch-2" >}})] switches that IPng Networks
|
||
runs in its AS8298 network, albeit without MPLS forwarding capabilities.
|
||
|
||
Still, for a LAB switch, to better test 25G and 100G topologies, this switch is very good value for
|
||
my money spent, and that it runs Debian and is fully configurable with things like Kees and Ansible.
|
||
Considering there's a whole range of 48x10G and 48x25G switches as well from Mellanox, all
|
||
completely open and _officially allowed_ to run OSS stuff on, these make a perfect fit for IPng
|
||
Networks!
|
||
|
||
### Acknowledgements
|
||
|
||
This article was written after fussing around and finding out, but a few references were particularly
|
||
helpful, and I'd like to acknowledge the following super useful sites:
|
||
|
||
* [[mlxsw wiki](https://github.com/Mellanox/mlxsw/wiki)] on GitHub
|
||
* [[jpirko's kernel driver](https://github.com/jpirko/linux_mlxsw)] on GitHub
|
||
* [[SONiC wiki](https://github.com/sonic-net/SONiC/wiki/)] on GitHub
|
||
* [[Spectrum Docs](https://www.nvidia.com/en-us/networking/ethernet-switching/spectrum-sn2000/)] on NVIDIA
|
||
|
||
And to the community for writing and maintaining this excellent switchdev implementation.
|