---
date: "2023-11-11T13:31:00Z"
title: Debian on Mellanox SN2700 (32x100G)
aliases:
- /s/articles/2023/11/11/mellanox-sn2700.html
---

# Introduction

I'm still hunting for a set of machines with which I can generate 1Tbps and 1Gpps of VPP traffic,
and considering a 100G network interface can do at most 148.8Mpps, I will need 7 or 8 of these
network cards. Doing a loadtest like this with DACs back-to-back is definitely possible, but it's a
bit more convenient to connect them all to a switch. However, for this to work I would need (at
least) fourteen or more HundredGigabitEthernet ports, and these switches tend to get expensive, real
quick.

Or do they?

## Hardware

{{< image width="400px" float="right" src="/assets/mlxsw/sn2700-1.png" alt="SN2700" >}}

I thought I'd ask the #nlnog IRC channel for advice, and of course the usual suspects came past,
such as Juniper, Arista, and Cisco. But somebody mentioned "How about Mellanox, like SN2700?" and I
remembered my buddy Eric was a fan of those switches. I looked them up on the refurbished market and
I found one for EUR 1'400,- for 32x100G which felt suspiciously low priced... but I thought YOLO and
I ordered it. It arrived a few days later via UPS from Denmark to Switzerland.

The switch specs are pretty impressive, with 32x100G QSFP28 ports, which can be broken out to a set
of sub-ports (each of 1/10/25/50G), with a specified switch throughput of 6.4Tbps in 4.76Gpps, while
only consuming ~150W all-up.

Further digging revealed that the architecture of this switch consists of two main parts:

{{< image width="400px" float="right" src="/assets/mlxsw/sn2700-2.png" alt="SN2700" >}}

1.   an AMD64 component with an mSATA disk to boot from, two e1000 network cards, and a single USB
     and RJ45 serial port with standard pinout. It has a PCIe connection to a switch board in the
     front of the chassis, further more it's equipped with 8GB of RAM in an SO-DIMM, and its CPU is a
     two core Celeron(R) CPU 1047UE @ 1.40GHz.

1.   the silicon used in this switch is called _Spectrum_ and identifies itself in Linux as PCI
     device `03:00.0` called _Mellanox Technologies MT52100_, so the front dataplane with 32x100G is
     separated from the Linux based controlplane.

{{< image width="400px" float="right" src="/assets/mlxsw/sn2700-3.png" alt="SN2700" >}}

When turning on the device, the serial port comes to life and shows me a BIOS, quickly after which
it jumps into GRUB2 and wants me to install it using something called ONIE. I've heard of that, but
now it's time for me to learn a little bit more about that stuff. I ask around and there's plenty of
ONIE images for this particular type of chip to be found - some are open source, some are semi-open
source (as in: were once available but now are behind paywalls etc).

Before messing around with the switch and possibly locking myself out or bricking it, I take out the
16GB mSATA and make a copy of it for safe keeping. I feel somewhat invincible by doing this. How bad
could I mess up this switch, if I can just copy back a bitwise backup of the 16GB mSATA? I'm about
to find out, so read on!

## Software

The Mellanox SN2700 switch is an ONIE (Open Network Install Environment) based platform that
supports a multitude of operating systems, as well as utilizing the advantages of Open Ethernet and
the capabilities of the Mellanox Spectrum® ASIC.  The SN2700 has three modes of operation:

*   Preinstalled with Mellanox Onyx (successor to MLNX-OS Ethernet), a home-grown operating system
    utilizing common networking user experiences and industry standard CLI.
*   Preinstalled with Cumulus Linux, a revolutionary operating system taking the Linux user experience
    from servers to switches and providing a rich routing functionality for large scale applications.
*   Provided with a bare ONIE image ready to be installed with the aforementioned or other ONIE-based
    operating systems.

I asked around a bit more and found that there's a few more things one might do with this switch.
One of them is [[SONiC](https://github.com/sonic-net/SONiC)], which stands for _Software for Open
Networking in the Cloud_, and has support for the _Spectrum_ and notably the _SN2700_ switch. Cool!

I also learned about [[DENT](https://dent.dev)], which utilizes the Linux Kernel, Switchdev,
and other Linux based projects as the basis for building a new standardized network operating system
without abstractions or overhead. Unfortunately, while the _Spectrum_ chipset is known to DENT, this
particular layout on SN2700 is not supported.

Finally, my buddy `fall0ut` said "why not just Debian with switchdev?" and now my eyes opened wide.
I had not yet come across [[switchdev](https://docs.kernel.org/networking/switchdev.html)], which is
a standard Linux kernel driver model for switch devices which offload the forwarding (data)plane
from the kernel. As it turns out, Mellanox did a really good job writing a switchdev implementation
in the [[linux kernel](https://github.com/torvalds/linux/tree/master/drivers/net/ethernet/mellanox/mlxsw)]
for the Spectrum series of silicon, and it's all upstreamed to the Linux kernel. Wait, what?!

### Mellanox Switchdev

I start by reading the [[brochure](/assets/mlxsw/PB_Spectrum_Linux_Switch.pdf)], which shows me the
intentions Mellanox had when designing and marketing these switches. It seems that they really meant
it when they said this thing is a fully customizable Linux switch, check out this paragraph:

> Once the Mellanox Switchdev driver is loaded into the Linux Kernel, each
> of the switch’s physical ports is registered as a net_device within the
> kernel. Using standard Linux tools (for example, bridge, tc, iproute), ports
> can be bridged, bonded, tunneled, divided into VLANs, configured for L3
> routing and more. Linux switching and routing tables are reflected in the
> switch hardware. Network traffic is then handled directly by the switch.
> Standard Linux networking applications can be natively deployed and
> run on switchdev. This may include open source routing protocol stacks,
> such as Quagga, Bird and XORP, OpenFlow applications, or user-specific
> implementations.

### Installing Debian on SN2700

.. they had me at Bird :) so off I go, to install a vanilla Debian AMD64 Bookworm on a 120G mSATA I
had laying around. After installing it, I noticed that the coveted `mlxsw` driver is not shipped by
default on the Linux kernel image in Debian, so I decide to build my own, letting the [[Debian
docs](https://wiki.debian.org/BuildADebianKernelPackage)] take my hand and guide me through it.

I find a reference on the Mellanox [[GitHub
wiki](https://github.com/Mellanox/mlxsw/wiki/Installing-a-New-Kernel)] which shows me which kernel
modules to include to successfully use the _Spectrum_ under Linux, so I think I know what to do:

```
pim@summer:/usr/src$ sudo apt-get install build-essential linux-source bc kmod cpio flex \
  libncurses5-dev libelf-dev libssl-dev dwarves bison
pim@summer:/usr/src$ sudo apt install linux-source-6.1
pim@summer:/usr/src$ sudo tar xf linux-source-6.1.tar.xz
pim@summer:/usr/src$ cd linux-source-6.1/
pim@summer:/usr/src/linux-source-6.1$ sudo cp /boot/config-6.1.0-12-amd64 .config
pim@summer:/usr/src/linux-source-6.1$ cat << EOF | sudo tee -a .config
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE_DEMUX=m
CONFIG_NET_IPGRE=m
CONFIG_IPV6_GRE=m
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_BRIDGE=m
CONFIG_VLAN_8021Q=m
CONFIG_BRIDGE_VLAN_FILTERING=y
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_NET_SWITCHDEV=y
CONFIG_NET_DEVLINK=y
CONFIG_MLXFW=m
CONFIG_MLXSW_CORE=m
CONFIG_MLXSW_CORE_HWMON=y
CONFIG_MLXSW_CORE_THERMAL=y
CONFIG_MLXSW_PCI=m
CONFIG_MLXSW_I2C=m
CONFIG_MLXSW_MINIMAL=y
CONFIG_MLXSW_SWITCHX2=m
CONFIG_MLXSW_SPECTRUM=m
CONFIG_MLXSW_SPECTRUM_DCB=y
CONFIG_LEDS_MLXCPLD=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS=y
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_CLS_MATCHALL=m
CONFIG_NET_CLS_FLOWER=m
CONFIG_NET_ACT_GACT=m
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_SAMPLE=m
CONFIG_NET_ACT_VLAN=m
CONFIG_NET_L3_MASTER_DEV=y
CONFIG_NET_VRF=m
EOF
pim@summer:/usr/src/linux-source-6.1$ sudo make menuconfig
pim@summer:/usr/src/linux-source-6.1$ sudo make -j`nproc` bindeb-pkg
```

I run a gratuitous `make menuconfig` after adding all those config statements to the end of the
`.config` file, and it figures out how to combine what I wrote before with what was in the file
earlier, and I used the standard Bookworm 6.1 kernel config that came from the default installer, so
that it would be a minimal diff to what Debian itself shipped with.

After Summer stretches her legs a bit compiling this kernel for me, look at the result:
```
pim@summer:/usr/src$ dpkg -c linux-image-6.1.55_6.1.55-4_amd64.deb | grep mlxsw
drwxr-xr-x root/root         0 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/
-rw-r--r-- root/root    414897 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_core.ko
-rw-r--r-- root/root     19721 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_i2c.ko
-rw-r--r-- root/root     31817 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_minimal.ko
-rw-r--r-- root/root     65161 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_pci.ko
-rw-r--r-- root/root   1425065 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_spectrum.ko
```

Good job, Summer! On my mSATA disk, I tell Linux to boot its kernel using the following in GRUB,
which will make the kernel not create spiffy interface names like `enp6s0` or `eno1` but just
enumerate them all one by one and call them `eth0` and so on:
```
pim@fafo:~$ grep GRUB_CMDLINE /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 net.ifnames=0 biosdevname=0"
```

## Mellanox SN2700 running Debian+Switchdev

{{< image width="400px" float="right" src="/assets/mlxsw/debian.png" alt="Debian" >}}

I insert the freshly installed Debian Bookworm with custom compiled 6.1.55+mlxsw kernel into the
switch, and it boots on the first try. I see 34 (!) ethernet ports, noting that the first two come
from an Intel NIC but carrying a MAC address from Mellanox (starting with `0c:42:a1`) and the other
32 have a common MAC address (from Mellanox, starting with `04:3f:72`), and what I noticed is that
the MAC addresses here are skipping one between subsequent ports, which leads me to believe that
these 100G ports can be split into two (perhaps 2x50G, 2x40G, 2x25G, 2x10G, which I intend to find
out later). According to the official spec sheet, the switch allows 2-way breakout ports as well as
converter modules, to insert for example a 25G SFP28 into a QSFP28 switchport.

Honestly, I did not think I would get this far, so I humorously (at least, I think so) decide to
call this switch [[FAFO](https://www.urbandictionary.com/define.php?term=FAFO)].

First off, the `mlxsw` driver loaded:
```
root@fafo:~# lsmod | grep mlx
mlxsw_spectrum        708608  0
mlxsw_pci              36864  1 mlxsw_spectrum
mlxsw_core            217088  2 mlxsw_pci,mlxsw_spectrum
mlxfw                  36864  1 mlxsw_core
vxlan                 106496  1 mlxsw_spectrum
ip6_tunnel             45056  1 mlxsw_spectrum
objagg                 53248  1 mlxsw_spectrum
psample                20480  1 mlxsw_spectrum
parman                 16384  1 mlxsw_spectrum
bridge                311296  1 mlxsw_spectrum
```

I run `sensors-detect` and `pwmconfig`, let the fans calibrate and write their config file. The fans
come back down to a more chill (pun intended) speed, and I take a closer look. It seems all fans and
all thermometers, including the ones in the QSFP28 cages and the _Spectrum_ switch ASIC are
accounted for:

```
root@fafo:~# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +30.0°C  (high = +87.0°C, crit = +105.0°C)
Core 0:        +29.0°C  (high = +87.0°C, crit = +105.0°C)
Core 1:        +30.0°C  (high = +87.0°C, crit = +105.0°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +27.8°C  (crit = +106.0°C)
temp2:        +29.8°C  (crit = +106.0°C)

mlxsw-pci-0300
Adapter: PCI adapter
fan1:            6239 RPM
fan2:            5378 RPM
fan3:            6268 RPM
fan4:            5378 RPM
fan5:            6326 RPM
fan6:            5442 RPM
fan7:            6268 RPM
fan8:            5315 RPM
temp1:            +37.0°C  (highest = +41.0°C)
front panel 001:  +23.0°C  (crit = +73.0°C, emerg = +75.0°C)
front panel 002:  +24.0°C  (crit = +73.0°C, emerg = +75.0°C)
front panel 003:  +23.0°C  (crit = +73.0°C, emerg = +75.0°C)
front panel 004:  +26.0°C  (crit = +73.0°C, emerg = +75.0°C)
...
```

From the top, first I see the classic CPU core temps, then an `ACPI` interface which I'm not quite
sure I understand the purpose of (possibly motherboard, but not PSU because pulling one out does not
change any values). Finally, the sensors using driver `mlxsw-pci-0300`, are those on the switch PCB
carrying the _Spectrum_ silicon, and there's a thermometer for each of the QSFP28 cages, possibly
reading from the optic, as most of them are empty except the first four which I inserted optics to.
Slick!


{{< image width="500px" float="right" src="/assets/mlxsw/debian-ethernet.png" alt="Ethernet" >}}

I notice that the ports are in a bit of a weird order. Firstly, eth0-1 are the two 1G ports on the
Debian machine. But then, the rest of the ports are the Mellanox Spectrum ASIC:
*   eth2-17 correspond to port 17-32, which seems normal, but
*   eth18-19 correspond to port 15-16
*   eth20-21 correspond to port 13-14
*   eth30-31 correspond to port 3-4
*   eth32-33 correspond to port 1-2

The switchports are actually sequentially numbered with respect to MAC addresses, with `eth2`
starting at `04:3f:72:74:a9:41` and finally `eth34` having `04:3f:72:74:a9:7f` (for 64 consecutive
MACs).

Somehow though, the  ports are wired in a different way on the front panel.  As it turns out, I can
insert a little `udev` ruleset that will take care of this:

```
root@fafo:~# cat << EOF > /etc/udev/rules.d/10-local.rules
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="mlxsw_spectrum*", \
    NAME="sw$attr{phys_port_name}"
EOF
```

After rebooting the switch, the ports are now called `swp1` .. `swp32` and they also correspond with
their physical ports on the front panel. One way to check this, is using `ethtool --identify swp1`
which will blink the LED of port 1,  until I press ^C. Nice.

### Debian SN2700: Diagnostics

The first thing I'm curious to try, is if _Link Layer Discovery Protocol_
[[LLDP](https://en.wikipedia.org/wiki/Link_Layer_Discovery_Protocol)] works. This is a
vendor-neutral protocol that network devices use to advertise their identity to peers over Ethernet.
I install an open source LLDP daemon and plug in a DAC from port1 to a Centec switch in the lab.

And indeed, quickly after that, I see two devices, the first on the Linux machine `eth0` which is
the Unifi switch that has my LAN, and the second is the Centec behind `swp1`:

```
root@fafo:~# apt-get install lldpd
root@fafo:~# lldpcli show nei summary
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface:    eth0, via: LLDP
  Chassis:
    ChassisID:    mac 44:d9:e7:05:ff:46
    SysName:      usw6-BasementServerroom
  Port:
    PortID:       local Port 9
    PortDescr:    fafo.lab
    TTL:          120
Interface:    swp1, via: LLDP
  Chassis:
    ChassisID:    mac 60:76:23:00:01:ea
    SysName:      sw3.lab
  Port:
    PortID:       ifname eth-0-25
    PortDescr:    eth-0-25
    TTL:          120
```

With this I learn that the switch forwards these datagrams (ethernet type `0x88CC`) from the
dataplane to the Linux controlplane. I would call this _punting_ in VPP language, but switchdev
calls it _trapping_, and I can see the LLDP packets when tcpdumping on ethernet device `swp1`.
So today I learned how to trap packets :-)

### Debian SN2700: ethtool

One popular diagnostics tool that is useful (and, hopefully well known because it's awesome),
is`ethtool`, a command-line tool in Linux for managing network interface devices. It allows me to
modify the parameters of the ports and their transceivers, as well as query the information of those
devices.

Here are few common examples, all of which work on this switch running Debian:

*   `ethtool swp1`: Shows link capabilities (eg, 1G/10G/25G/40G/100G)
*   `ethtool -s swp1 speed 40000 duplex full autoneg off`: Force speed/duplex
*   `ethtool -m swp1`: Shows transceiver diagnostics like SFP+ light levels, link levels
    (also `--module-info`)
*   `ethtool -p swp1`: Flashes the transceiver port LED (also `--identify`)
*   `ethtool -S swp1`: Shows packet and octet counters, and sizes, discards, errors, and so on
    (also `--statistics`)

I specifically love the _digital diagnostics monitoring_ (DDM), originally specified in
[[SFF-8472](https://members.snia.org/document/dl/25916)], which allows me to read the EEPROM of
optical transceivers and get all sorts of critical diagnostics. I wish DPDK and VPP had that!

### Debian SN2700: devlink

In reading up on the _switchdev_ ecosystem, I stumbled across `devlink`, an API to expose device
information and resources not directly related to any device class, such as switch ASIC
configuration. As a fun fact, devlink was written by the same engineer who wrote the `mlxsw` driver
for Linux, Jiří Pírko. Its documentation can be found in the [[linux
kernel](https://docs.kernel.org/networking/devlink/index.html)], and it ships with any modern
`iproute2` distribution. The specific (somewhat terse) documentation of the `mlxsw` driver
[[lives there](https://docs.kernel.org/networking/devlink/mlxsw.html)] as well.

There's a lot to explore here, but I'll focus my attention to three things:

#### 1. devlink resource

When learning that the switch also does IPv4 and IPv6 routing, I immediately thought: how many
prefixes can be offloaded to the ASIC? One way to find out is to query what types of _resources_ it
has:

```
root@fafo:~# devlink resource show pci/0000:03:00.0
pci/0000:03:00.0:
  name kvd size 258048 unit entry dpipe_tables none
    resources:
      name linear size 98304 occ 1 unit entry size_min 0 size_max 159744 size_gran 128 dpipe_tables none
        resources:
          name singles size 16384 occ 1 unit entry size_min 0 size_max 159744 size_gran 1 dpipe_tables none
          name chunks size 49152 occ 0 unit entry size_min 0 size_max 159744 size_gran 32 dpipe_tables none
          name large_chunks size 32768 occ 0 unit entry size_min 0 size_max 159744 size_gran 512 dpipe_tables none
      name hash_double size 65408 unit entry size_min 32768 size_max 192512 size_gran 128 dpipe_tables none
      name hash_single size 94336 unit entry size_min 65536 size_max 225280 size_gran 128 dpipe_tables none
  name span_agents size 3 occ 0 unit entry dpipe_tables none
  name counters size 32000 occ 4 unit entry dpipe_tables none
    resources:
      name rif size 8192 occ 0 unit entry dpipe_tables none
      name flow size 23808 occ 4 unit entry dpipe_tables none
  name global_policers size 1000 unit entry dpipe_tables none
    resources:
      name single_rate_policers size 968 occ 0 unit entry dpipe_tables none
  name rif_mac_profiles size 1 occ 0 unit entry dpipe_tables none
  name rifs size 1000 occ 1 unit entry dpipe_tables none
  name physical_ports size 64 occ 36 unit entry dpipe_tables none
```

There's a lot to unpack here, but this is a tree of resources, each with names and children. Let me
focus on the first one, called `kvd`, which stands for _Key Value Database_ (in other words, a set
of lookup tables). It contains a bunch of children called `linear`, `hash_double` and `hash_single`.
The kernel [[docs](https://www.kernel.org/doc/Documentation/ABI/testing/devlink-resource-mlxsw)]
explain it in more detail, but this is where the switch will keep its FIB in _Content Addressable
Memory_ (CAM) of certain types of elements of a given length and count. All up, the size is 252KB,
which is not huge, but also certainly not tiny!

Here I learn that it's subdivided into:
*   ***linear***: 96KB bytes of flat memory using an index, further divided into regions:
    *     ***singles***: 16KB of size 1, nexthops
    *     ***chunks***: 48KB of size 32, multipath routes with <32 entries
    *     ***large_chunks***: 32KB of size 512, multipath routes with <512 entries
*   ***hash_single***: 92KB bytes of hash table for keys smaller than 64 bits (eg. L2 FIB, IPv4 FIB and
    neighbors)
*   ***hash_double***: 63KB bytes of hash table for keys larger than 64 bits (eg. IPv6 FIB and
    neighbors)


#### 2. devlink dpipe

Now that I know the memory layout and regions of the CAM, I can start making some guesses on the FIB
size. The devlink pipeline debug API (DPIPE) is aimed at providing the user visibility into the
ASIC's pipeline in a generic way. The API is described in detail in the [[kernel
docs](https://docs.kernel.org/networking/devlink/devlink-dpipe.html)]. I feel free to take a peek at
the dataplane configuration innards:

```
root@fafo:~# devlink dpipe table show pci/0000:03:00.0
pci/0000:03:00.0:
  name mlxsw_erif size 1000 counters_enabled false
    match:
      type field_exact header mlxsw_meta field erif_port mapping ifindex
    action:
      type field_modify header mlxsw_meta field l3_forward
      type field_modify header mlxsw_meta field l3_drop
  name mlxsw_host4 size 0 counters_enabled false resource_path /kvd/hash_single resource_units 1
    match:
      type field_exact header mlxsw_meta field erif_port mapping ifindex
      type field_exact header ipv4 field destination ip
    action:
      type field_modify header ethernet field destination mac
  name mlxsw_host6 size 0 counters_enabled false resource_path /kvd/hash_double resource_units 2
    match:
      type field_exact header mlxsw_meta field erif_port mapping ifindex
      type field_exact header ipv6 field destination ip
    action:
      type field_modify header ethernet field destination mac
  name mlxsw_adj size 0 counters_enabled false resource_path /kvd/linear resource_units 1
    match:
      type field_exact header mlxsw_meta field adj_index
      type field_exact header mlxsw_meta field adj_size
      type field_exact header mlxsw_meta field adj_hash_index
    action:
      type field_modify header ethernet field destination mac
      type field_modify header mlxsw_meta field erif_port mapping ifindex
```

From this I can puzzle together how the CAM is _actually_ used:

*   ***mlxsw_host4***: matches on the interface port and IPv4 destination IP, using `hash_single`
    above with one unit for each entry, and when looking that up, puts the result into the ethernet
    destination MAC (in other words, the FIB entry points at an L2 nexthop!)
*   ***mlxsw_host6***: matches on the interface port and IPv6 destination IP using `hash_double`
    with two units for each entry.
*   ***mlxsw_adj***: holds the L2 adjacencies, and the lookup key is an index, size and hash index,
    where the returned value is used to rewrite the destination MAC and select the egress port!


Now that I know the types of tables and what they are matching on (and then which action they are
performing), I can also take a look at the _actual data_ in the FIB. For example, if I create an IPv4
interface on the switch and ping a member on directly connected network there, I can see an entry
show up in the L2 adjacency table, like so:

```
root@fafo:~# ip addr add 100.65.1.1/30 dev swp31
root@fafo:~# ping 100.65.1.2
root@fafo:~# devlink dpipe table dump pci/0000:03:00.0  name mlxsw_host4
pci/0000:03:00.0:
  index 0
    match_value:
      type field_exact header mlxsw_meta field erif_port mapping ifindex mapping_value 71 value 1
      type field_exact header ipv4 field destination ip value 100.65.1.2
    action_value:
      type field_modify header ethernet field destination mac value b4:96:91:b3:b1:10
```

To decypher what the switch is doing: if the ifindex is `71` (which corresponds to `swp31`), and the
IPv4 destination IP address is `100.65.1.2`, then the destination MAC address will be set to
`b4:96:91:b3:b1:10`, so the switch knows where to send this ethernet datagram.

And now I have found what I need to know  to be able to answer the question of the FIB size.  This
switch can take **92K IPv4 routes** and **31.5K IPv6 routes**, and I can even inspect the FIB in great
detail. Rock on!

#### 3. devlink port split

But reading the switch chip configuration and FIB is not all that `devlink` can do, it can also make
changes! One particularly interesting one is the ability to split and unsplit ports. What this means
is that, when you take a 100Gbit port, it internally is divided into four so-called _lanes_ of
25Gbit each, where a 40Gbit port is internally divided into four _lanes_ of 10Gbit each. Splitting
ports is the act of taking such a port and reconfiguring its lanes.

Let me show you, by means of example, what spliting the first two switchports might look like. They
begin their life as 100G ports, which support a number of link speeds, notably: 100G, 50G, 25G, but
also 40G, 10G, and finally 1G:

```
root@fafo:~# ethtool swp1
Settings for swp1:
        Supported ports: [ FIBRE ]
        Supported link modes:   1000baseKX/Full
                                10000baseKR/Full
                                40000baseCR4/Full
                                40000baseSR4/Full
                                40000baseLR4/Full
                                25000baseCR/Full
                                25000baseSR/Full
                                50000baseCR2/Full
                                100000baseSR4/Full
                                100000baseCR4/Full
                                100000baseLR4_ER4/Full

root@fafo:~# devlink port show | grep 'swp[12] '
pci/0000:03:00.0/61: type eth netdev swp1 flavour physical port 1 splittable true lanes 4
pci/0000:03:00.0/63: type eth netdev swp2 flavour physical port 2 splittable true lanes 4
root@fafo:~# devlink port split pci/0000:03:00.0/61 count 4
[  629.593819] mlxsw_spectrum 0000:03:00.0 swp1: link down
[  629.722731] mlxsw_spectrum 0000:03:00.0 swp2: link down
[  630.049709] mlxsw_spectrum 0000:03:00.0: EMAD retries (1/5) (tid=64b1a5870000c726)
[  630.092179] mlxsw_spectrum 0000:03:00.0 swp1s0: renamed from eth2
[  630.148860] mlxsw_spectrum 0000:03:00.0 swp1s1: renamed from eth2
[  630.375401] mlxsw_spectrum 0000:03:00.0 swp1s2: renamed from eth2
[  630.375401] mlxsw_spectrum 0000:03:00.0 swp1s3: renamed from eth2

root@fafo:~# ethtool swp1s0
Settings for swp1s0:
        Supported ports: [ FIBRE ]
        Supported link modes:   1000baseKX/Full
                                10000baseKR/Full
                                25000baseCR/Full
                                25000baseSR/Full
```

Whoa, what just happened here? The switch took the port defined by `pci/0000:03:00.0/61` which says
it is _splittable_ and has four lanes, and split it into four NEW ports called `swp1s0`-`swp1s3`,
and the resulting ports are 25G, 10G or 1G.

{{< image width="100px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}

However, I make an important observation. When splitting `swp1` in 4, the switch also removed port
`swp2`, and remember at the beginning of this article I mentioned that the MAC addresses seemed to
skip one entry between subsequent interfaces? Now I understand why: when spltting the port into two,
it will use the second MAC address for the second 50G port; but if I split it into four, it'll use
the MAC addresses from the adjacent port and decommission it. In other words: this switch can do
32x100G, or 64x50G, or 64x25G/10G/1G.

It doesn't matter which of the PCI interfaces I split on. The operation is also reversible, I can
issue `devlink port unsplit` to return the port to its aggregate state (eg. 4 lanes and 100Gbit),
which will remove the `swp1s0-3` ports and put back `swp1` and `swp2` again.

What I find particularly impressive about this, is that for most hardware vendors, this splitting of
ports requires a reboot of the chassis, while here it can happen entirely online. Well done,
Mellanox!

## Performance

OK, so this all seems to work, but does it work _well_? If you're a reader of my blog you'll know
that I love doing loadtests, so I boot my machine, Hippo, and I connect it with two 100G DACs to the
switch on ports 31 and 32:

```
[    1.354802] ice 0000:0c:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
[    1.447677] ice 0000:0c:00.0: firmware: direct-loading firmware intel/ice/ddp/ice.pkg
[    1.561979] ice 0000:0c:00.1: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
[    7.738198] ice 0000:0c:00.0 enp12s0f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC,
  Negotiated FEC: RS-FEC, Autoneg Advertised: On, Autoneg Negotiated: True, Flow Control: None
[    7.802572] ice 0000:0c:00.1 enp12s0f1: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC,
  Negotiated FEC: RS-FEC, Autoneg Advertised: On, Autoneg Negotiated: True, Flow Control: None
```

I hope you're hungry, Hippo, cuz you're about to get fed!

### Debian SN2700: L2

To use the switch in L2 mode, I intuitively create Linux bridge, say `br0`, and add ports to that.
From the Mellanox documentation I learn that there can be multiple bridges, each isolated from one
another, but there can only be _one_ such bridge with `vlan_filtering` set. VLAN Filtering allows
the switch to only accept tagged frames from a list of _configured_ VLANs, and drop the rest. This
is what you'd imagine a regular commercial switch would provide.

So off I go, creating the bridge in which I'll add two ports (HundredGigabitEthernet port swp31 and
swp32), and I will allow for the maximum MTU size of 9216, also known as [[Jumbo Frames](https://en.wikipedia.org/wiki/Jumbo_frame)].

```
root@fafo:~# ip link add name br0 type bridge
root@fafo:~# ip link set br0 type bridge vlan_filtering 1 mtu 9216 up
root@fafo:~# ip link set swp31 mtu 9216 master br0 up
root@fafo:~# ip link set swp32 mtu 9216 master br0 up
```

These two ports are now _access_ ports, that is to say they accept and emit only untagged traffic,
and due to the `vlan_filtering` flag, they will drop all other frames. Using the standard `bridge`
utility from Linux, I can manipulate the VLANs on these ports.

First, I'll remove the _default VLAN_ and add VLAN 1234 to both ports, specifying that VLAN 1234 is
the so-called _Port VLAN ID_ (pvid). This makes them the equivalent of Cisco's _switchport access
1234_:

```
root@fafo:~# bridge vlan del vid 1 dev swp1
root@fafo:~# bridge vlan del vid 1 dev swp2
root@fafo:~# bridge vlan add vid 1234 dev swp1 pvid
root@fafo:~# bridge vlan add vid 1234 dev swp2 pvid
```
Then, I'll add a few tagged VLANs to the ports, so that they become the Cisco equivalent of a
_trunk_ port allowing these tagged VLANs and assuming untagged traffic is still VLAN 1234:

```
root@fafo:~# for port in swp1 swp2; do for vlan in 100 200 300 400; do \
               bridge vlan add vid $vlan dev $port; done; done
root@fafo:~# bridge vlan
port              vlan-id
swp1              100 200 300 400
                  1234 PVID
swp2              100 200 300 400
                  1234 PVID
br0               1 PVID Egress Untagged
```

When these commands are run against the interfaces `swp*`, they are picked up by the `mlxsw` kernel
driver, and transmitted to the _Spectrum_ switch chip, in other words, these commands end up
programming the silicon. Traffic through these switch ports on the front, rarely (if ever) get
forwarded to the Linux kernel, very similar to [[VPP](https://fd.io/)], the traffic stays mostly in
the dataplane. Some traffic, such as LLDP (and as we'll see later, IPv4 ARP and IPv6 neighbor
discovery), will be forwarded from the switch chip over the PCIe link to the kernel, after which the
results are transmitted back via PCIe to program the switch chip L2/L3 _Forwarding Information Base_
(FIB).

Now I turn my attention to the loadtest, by configuring T-Rex in L2 Stateless mode. I start a
bidirectional loadtest with 256b packets at 50% of line rate, which looks just fine:

{{< image src="/assets/mlxsw/trex1.png" alt="Trex L2" >}}

At this point I can already conclude that this is all happening in the dataplane, as the _Spectrum_
switch is connected to the Debian machine using a PCIe v3.0 x8 link, which is even obscured by
another device on the PCIe bus, so the Debian kernel is in no way able to process more than a token
amount of traffic, and yet I'm seeing 100Gbit go through the switch chip and the CPU load on the
kernel pretty much zero. I can however retrieve the link statistics using `ip stats`, and those will
show me the actual counters of the silicon, not _just_ the trapped packets. If you'll recall, in VPP
the only packets that the TAP interfaces see are those packets that are _punted_, and the Linux
kernel there is completely oblivious to the total dataplane throughput. Here, the interface is
showing the correct dataplane packet and byte counters, which means that things like SNMP will
automatically just do the right thing.

```
root@fafo:~# dmesg | grep 03:00.*bandwidth
[    2.180410] pci 0000:03:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s
  PCIe x4 link at 0000:00:01.2 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link)

root@fafo:~# uptime
 03:19:16 up 2 days, 14:14,  1 user,  load average: 0.00, 0.00, 0.00

root@fafo:~# ip stats show dev swp32 group link
72: swp32: group link
    RX:      bytes      packets errors dropped  missed   mcast
     5106713943502  15175926564      0       0       0     103
    TX:      bytes      packets errors dropped carrier collsns
    23464859508367 103495791750      0       0       0       0
```

### Debian SN2700: IPv4 and IPv6

I now take a look at the L3 capabilities of the switch. To do this, I simply destroy the bridge
`br0`, which will return the enslaved switchports. I then convert the T-Rex loadtester to use an L3
profile, and configure the switch as follows:

```
root@fafo:~# ip addr add 100.65.1.1/30 dev swp31
root@fafo:~# ip nei replace 100.65.1.2 lladdr b4:96:91:b3:b1:10 dev swp31
root@fafo:~# ip ro add 16.0.0.0/8 via 100.65.1.2 dev swp31

root@fafo:~# ip addr add 100.65.2.1/30 dev swp32
root@fafo:~# ip nei replace 100.65.2.2 lladdr b4:96:91:b3:b1:11 dev swp32
root@fafo:~# ip ro add 48.0.0.0/8 via 100.65.2.2 dev swp32
```
Several other routers I've loadtested have the same (cosmetic) issue, that T-Rex doesn't reply to
ARP packets after the first few seconds, so I first set the IPv4 address, then add a static L2
adjacency for the T-Rex side (on MAC `b4:96:91:b3:b1:10`), and route 16.0.0.0/8 to port 0 and I
route 48.0.0.0/8 to port 1 of the loadtester.

{{< image src="/assets/mlxsw/trex2.png" alt="Trex L3" >}}

I start a stateless L3 loadtest with 192 byte packets in both directions, and the switch keeps up
just fine. Taking a closer look at the `ip stats` instrumentation, I see that there's the ability to
turn on L3 counters in addition to L2 (ethernet) counters. So I do that on my two router ports while
they are happily forwarding 58.9Mpps, and I can now see the difference between dataplane (forwarded in
hardware) and CPU (forwarded by the CPU)

```
root@fafo:~# ip stats set dev swp31 l3_stats on
root@fafo:~# ip stats set dev swp32 l3_stats on
root@fafo:~# ip stats show dev swp32 group offload subgroup l3_stats
72: swp32: group offload subgroup l3_stats on used on
    RX:       bytes       packets errors dropped   mcast
    270222574848200 1137559577576      0       0       0
    TX:       bytes       packets errors dropped
    281073635911430 1196677185749      0       0
root@fafo:~# ip stats show dev swp32 group offload subgroup cpu_hit
72: swp32: group offload subgroup cpu_hit
    RX:  bytes packets errors dropped  missed   mcast
       1068742   17810      0       0       0       0
    TX:  bytes packets errors dropped carrier collsns
        468546    2191      0       0       0       0
```

The statistics above clearly demonstrate that the lion's share of the packets have been forwarded by
the ASIC, and only a few (notably things like IPv6 neighbor discovery, IPv4 ARP, LLDP, and of course
any traffic _to_ the IP addresses configured on the router) will go to the kernel.

#### Debian SN2700: BVI (or VLAN Interfaces)

I've played around a little bit with L2 (switch) and L3 (router) ports, but there is one middle
ground. I'll keep the T-Rex loadtest running in L3 mode, but now I'll reconfigure the switch to put
the ports back into the bridge, each port in its own VLAN, and have so-called _Bridge Virtual
Interface_, also known as VLAN interfaces -- this is where the switch has a bunch of ports together
in a VLAN, but the switch itself has an IPv4 or IPv6 address in that VLAN as well, which can act as
a router.


I reconfigure the switch to put the interfaces back into VLAN 1000 and 2000 respectively, and move
the IPv4 addresses and routes there -- so here I go, first putting the switch interfaces back into
L2 mode and adding them to the bridge, each in their own VLAN, by making them access ports:

```
root@fafo:~# ip link add name br0 type bridge vlan_filtering 1
root@fafo:~# ip link set br0 address 04:3f:72:74:a9:7d mtu 9216 up
root@fafo:~# ip link set swp31 master br0 mtu 9216 up
root@fafo:~# ip link set swp32 master br0 mtu 9216 up
root@fafo:~# bridge vlan del vid 1 dev swp31
root@fafo:~# bridge vlan del vid 1 dev swp32
root@fafo:~# bridge vlan add vid 1000 dev swp31 pvid
root@fafo:~# bridge vlan add vid 2000 dev swp32 pvid
```

From the ASIC specs, I understand that these BVIs need to (re)use a MAC from one of the members, so
the first thing I do is give `br0` the right MAC address. Then I put the switch ports into the bridge,
remove VLAN 1 and put them in their respective VLANs. At this point, the loadtester reports 100%
packet loss, because the two ports can no longer see each other at layer2, and layer3 configs have
been removed. But I can restore connectivity with two _BVIs_ as follows:

```
root@fafo:~# for vlan in 1000 2000; do
  ip link add link br0 name br0.$vlan type vlan id $vlan
  bridge vlan add dev br0 vid $vlan self
  ip link set br0.$vlan up mtu 9216
done

root@fafo:~# ip addr add 100.65.1.1/24 dev br0.1000
root@fafo:~# ip ro add 16.0.0.0/8 via 100.65.1.2
root@fafo:~# ip nei replace 100.65.1.2 lladdr b4:96:91:b3:b1:10 dev br0.1000

root@fafo:~# ip addr add 100.65.2.1/24 dev br0.2000
root@fafo:~# ip ro add 48.0.0.0/8 via 100.65.2.2
root@fafo:~# ip nei replace 100.65.2.2 lladdr b4:96:91:b3:b1:11 dev br0.2000
```

And with that, the loadtest shoots back in action:
{{< image src="/assets/mlxsw/trex3.png" alt="Trex L3 BVI" >}}

First a quick overview of the sitation I have created:

```
root@fafo:~# bridge vlan
port              vlan-id
swp31             1000 PVID
swp32             2000 PVID
br0               1 PVID Egress Untagged

root@fafo:~# ip -4 ro
default via 198.19.5.1 dev eth0 onlink rt_trap
16.0.0.0/8 via 100.65.1.2 dev br0.1000 offload rt_offload
48.0.0.0/8 via 100.65.2.2 dev br0.2000 offload rt_offload
100.65.1.0/24 dev br0.1000 proto kernel scope link src 100.65.1.1 rt_offload
100.65.2.0/24 dev br0.2000 proto kernel scope link src 100.65.2.1 rt_offload
198.19.5.0/26 dev eth0 proto kernel scope link src 198.19.5.62 rt_trap

root@fafo:~# ip -4 nei
198.19.5.1 dev eth0 lladdr 00:1e:08:26:ec:f3 REACHABLE
100.65.1.2 dev br0.1000 lladdr b4:96:91:b3:b1:10 offload PERMANENT
100.65.2.2 dev br0.2000 lladdr b4:96:91:b3:b1:11 offload PERMANENT
```

Looking at the situation now, compared to the regular IPv4 L3 loadtest, there is one important
difference. Now, the switch can have any number of ports in VLAN 1000, which will all amongst
themselves do L2 forwarding at line rate, and when they need to send IPv4 traffic out, they will ARP
for the gateway (for example at `100.65.1.1/24`), which will get _trapped_ and forwarded to the CPU,
after which the ARP reply will go out so that the machines know where to find the gateway. From that
point on, IPv4 forwarding happens once again in hardware, which can be shown by the keywords
`rt_offload` in the routing table (`br0`, in the ASIC), compared to the `rt_trap` (`eth0`, in the
kernel). Similarly for the IPv4 neighbors, the L2 adjacency is programmed into the CAM (the output
of which I took a look at above), do forwarding can be done directly by the ASIC without
intervention from the CPU.

As a result, these _VLAN Interfaces_ (which are synonymous with _BVIs_), work at line rate out of
the box.

### Results

This switch is phenomenal, and Jiří Pírko and the Mellanox team truly outdid themselves with their `mlxsw`
switchdev implementation. I have in my hands a very affordable 32x100G or 64x(50G, 25G, 10G, 1G) and
anything in between, with IPv4 and IPv6 forwarding in hardware, with a limited FIB size, not too
dissimilar from the [[Centec]({{< ref "2022-12-09-oem-switch-2" >}})] switches that IPng Networks
runs in its AS8298 network, albeit without MPLS forwarding capabilities.

Still, for a LAB switch, to better test 25G and 100G topologies, this switch is very good value for
my money spent, and that it runs Debian and is fully configurable with things like Kees and Ansible.
Considering there's a whole range of 48x10G and 48x25G switches as well from Mellanox, all
completely open and _officially allowed_ to run OSS stuff on, these make a perfect fit for IPng
Networks!

### Acknowledgements

This article was written after fussing around and finding out, but a few references were particularly
helpful, and I'd like to acknowledge the following super useful sites:

*   [[mlxsw wiki](https://github.com/Mellanox/mlxsw/wiki)] on GitHub
*   [[jpirko's kernel driver](https://github.com/jpirko/linux_mlxsw)] on GitHub
*   [[SONiC wiki](https://github.com/sonic-net/SONiC/wiki/)] on GitHub
*   [[Spectrum Docs](https://www.nvidia.com/en-us/networking/ethernet-switching/spectrum-sn2000/)] on NVIDIA

And to the community for writing and maintaining this excellent switchdev implementation.