---
date: "2022-12-05T11:56:54Z"
title: 'Review: S5648X-2Q4Z Switch - Part 1: VxLAN/GENEVE/NvGRE'
aliases:
- /s/articles/2022/12/05/oem-switch-1.html
---

After receiving an e-mail from a newer [[China based switch OEM](https://starry-networks.com/)], I
had a chat with their founder and learned that the combination of switch silicon and software may be
a good match for IPng Networks. You may recall my previous endeavors in the Fiberstore lineup,
notably an in-depth review of the [[S5860-20SQ]({{< ref "2021-08-07-fs-switch" >}})] which sports
20x10G, 4x25G and 2x40G optics, and its larger sibling the S5860-48SC which comes with 48x10G and
8x100G cages. I use them in production at IPng Networks and their featureset versus price point is
pretty good. In that article, I made one critical note reviewing those FS switches, in that they'e
be a better fit if they allowed for MPLS or IP based L2VPN services in hardware.

{{< image width="450px" float="left" src="/assets/oem-switch/S5624X-front.png" alt="S5624X Front" >}}

{{< image width="450px" float="right" src="/assets/oem-switch/S5648X-front.png" alt="S5648X Front" >}}

<br /><br />

I got cautiously enthusiastic (albeit suitably skeptical) when this new vendor claimed VxLAN,
GENEVE, MPLS and GRE at 56 ports and line rate, on a really affordable budget (sub-$4K for the 56
port; and sub-$2K for the 26 port switch). This reseller is using a less known silicon vendor called
[[Centec](https://www.centec.com/silicon)], who have a lineup of ethernet silicon. In this device,
the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired
with 4x100GbE uplink capability. This is Centec's fourth generation, so CTC8096 inherits the feature
set from L2/L3 switching to advanced data center and metro Ethernet features with innovative
enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE
ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS SR,
and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic management,
and Network time synchronization.

This will be the first of a set of write-ups exploring the hard- and software functionality of this new
vendor. As we'll see, it's all about the _software_.

## Detailed findings

### Hardware

{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-front-opencase.png" alt="Front" >}}

The switch comes well packaged with two removable 400W Gold powersupplies from _Compuware
Technology_ which output 12V/33A and +5V/3A as well as four removable PWM controlled fans from
_Protechnic_.  The fans are expelling air, so they are cooling front-to-back on this unit. Looking at
the fans, changing them to pull air back-to-front would be possible after-sale, by flipping the fans
around as they're attached in their case by two M4 flat-head screws. This is truly meant to be
an OEM switch -- there is no logo or sticker with the vendor's name, so I should probably print a
few vinyl IPng stickers to skin them later.

{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-daughterboard.png" alt="S5648 Daughterboard" >}}

On the front, the switch sports an RJ45 standard serial console, a mini-usb connector of which the
function is not clear to me, an RJ45 network port used for management, a pinhole which houses a
reset button labeled `RST` and two LED indicators labeled `ID` and `SYS`.  The serial port runs at
115200,8n1 and the managment network port is Gigabit.

Regarding the regular switch ports, there are 48x SFP+ cages, 4x QSFP28 (port 49-52) runing at
100Gbit, and 2x QSFP+ ports (53-54) running at 40Gbit. All ports (management and switch) present a
MAC address from OUI `00-1E-08`, which is assigned to Centec.

The switch is not particularly quiet, as its six fans total start up at a high pitch but once the
switch boots, they calm down and emit noise levels as you would expect from a datacenter unit. I
measured it at 74dBA when booting, and otherwise at around 62dBA when running. On the inside, the
PCB is rather clean. It comes with a daughter board, housing a small PowerPC P1010 with 533MHz CPU,
1GB of RAM, and 2GB flash on board, which is running Linux. This is the same card that many of the
FS.com switches use (eg. S5860-48S6Q), a cheaper alternative to the high end Intel Xeon-D.

{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-switchchip-block.png" alt="S5648 Switchchip" >}}

#### S5648X (48x10, 2x40, 4x100)

There is one switch chip, on the front of the PCB, connecting all 54 ports. It has a sizable
heatsink on it, drawing air backwards through ports (36-48). The switch uses a less well known and
somewhat dated Centec [[CTC8096](https://www.centec.com/silicon/Ethernet-Switch-Silicon/CTC8096)],
codenamed GoldenGate and released in 2015, which is rated for 1.2Tbps of aggregate throughput. The
chip can be programmed to handle a bunch of SDN protocols, including VxLAN, GRE, GENEVE, and MPLS /
MPLS SR, with a limited TCAM to hold things like ACLs, IPv4/IPv6 routes and MPLS labels. The CTC8096
provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE ports. The SerDES design is
pretty flexible, allowing it to mix and match ports.

You can see more (hires) pictures and screenshots throughout these articles in this [[Photo
Album](https://photos.app.goo.gl/Mxzs38p355Bo4qZB6)].

#### S5624X (24x10, 2x100)

In case you're curious (I certainly was!) the smaller unit (with 24x10+2x100) is built off of the
Centic [[CTC7132](https://www.centec.com/silicon/Ethernet-Switch-Silicon/CTC7132)], codenamed
TsingMa which released in 2019, and it offers a variety of similar features, including L2, L3, MPLS,
VXLAN, MPLS SR, and OAM/APS.  Highlights features include Telemetry, Programmability, Security and
traffic management, and Network time synchronization. The SoC has an embedded ARM A53 CPU Core
running at 800MHz, and the SerDES on this chip allows for 24x1G/2.5G/5G/10G and 2x40G/100G for a
throughput of 440Gbps but at a failry sharp price point.

One thing worth noting (because I know some of my regular readers will already be wondering!) this
series of chips (both the 4th generation CTC8096 and the sixth generation CTC7132) come with a very
modest TCAM, which means in practice 32K MAC addresses, 8K IPv4 routes, 1K IPv6 routes, a 6K MPLS
table, 1K L2VPN instances, and 64 VPLS instances. The Centec comes as well with a modest 32MB packet
buffer shared between all ports, and the controlplane comes with 1GB of memory and a 533MHz ARM. So
no, this won't take a full table :-) but in all honesty, that's not the thing this machine is built
to do.

When booted, the switch draws roughly 68 Watts combined on its two power supplies, and I find that
pretty cool considering the total throughput offered. Of course, once optics are inserted, the total
power draw will go up. Also worth noting, when the switch is under load, the Centec chip will
consume more power, for example when forwarding 8x10G + 2x100G, the total consumption was 88 Watts,
totally respectable now that datacenter power bills are skyrocketing.

### Topology

{{< image width="500px" float="right" src="/assets/oem-switch/topology.svg" alt="Front" >}}

On the heals of my [[DENOG14 Talk](/media/denog14/index.html)], in which I showed how VPP can
_route_ 150Mpps and 180Gbps on a 10 year old Dell while consuming a full 1M BGP table in 7 seconds
or so, I still had a little bit of a LAB left to repurpose. So I build the following topology using
the loadtester, packet analyzer, and switches:

*  **msw-top**: S5624-2Z-EI switch
*  **msw-core**: S5648X-2Q4ZA switch
*  **msw-bottom**: S5624-2Z-EI switch
*  All switches connect to:
   *   each other with 100G DACs (right, black)
   *   T-Rex machine with 4x10G (left, rainbow)
*  Each switch gets a mgmt IPv4 and IPv6

With this topology I will have enough wiggle room to patch anything to anything. Now that the
physical part is out of the way, let's take a look at the firmware of these things!

### Software

As can be seen in the topology above, I am testing three of these switches - two are the smaller sibling
[S5624X 2Z-EI] (which come with 24x10G SFP+ and 2x100G QSFP28), and one is this [S5648X 2Q4Z]
pictured above. The vendor has a licensing system, for basic L2, basic L3 and advanced metro L3.
These switches come with the most advanced/liberal licenses, which means all of the features will
work on the switches, notably, MPLS/LDP and VPWS/VPLS.

Taking a look at the CLI, it's very Cisco IOS-esque; there's a few small differences, but the look
and feel is definitely familiar. Base configuration kind of looks like this:

#### Basic config

```
msw-core# show running-config
management ip address 192.168.1.33/24
management route add gateway 192.168.1.252
!
ntp server 216.239.35.4
ntp server 216.239.35.8
ntp mgmt-if enable
!
snmp-server enable
snmp-server system-contact noc@ipng.ch
snmp-server system-location Bruttisellen, Switzerland
snmp-server community public read-only
snmp-server version v2c

msw-core# conf t
msw-core(config)# stm prefer ipran
```

A few small things of note. There is no `mgmt0` device as I would've expected. Instead, the SoC
exposes its management interface to be configured with these `management ...` commands. The IPv4 can
be either DHCP or a static address, and IPv6 can only do static addresses. Only one (default)
gateway can be set for either protocol. Then, NTP can be set up to work on the `mgmt-if` which is a
useful way to use it for timekeeping.

The SNMP server works both from the `mgmt-if` and from the dataplane, which is nice.
SNMP supports everything you'd expect, including v3 and traps for all sorts of events, including
IPv6 targets and either dataplane or `mgmt-if`.

I did notice that the nameserver cannot use the `mgmt-if`, so I left it unconfigured. I found it a
little bit odd, considering all the other functionality does work just fine over the `mgmt-if`.

If you've run CAM-based systems before, you'll likely have come across some form of _partitioning_
mechanism, to allow certain types in the CAM (eg. IPv4, IPv6, L2 MACs, MPLS labels, ACLs) to have
more or fewer entries. This is particularly relevant on this switch because it has a comparatively
small CAM. It turns out, that by default MPLS is entirely disabled, and to turn it on (and sacrifice
some of that sweet sweet content addressable memory), I have to issue the command `stm prefer ipran`
(other flavors are _ipv6_, _layer3_, _ptn_, and _default_), and reload the switch.

Having been in the networking industry for a while, I scratched my head on the acronym **IPRAN**, so
I will admit having to look it up. It's a general term used to describe an IP based Radio Access
Network (2G, 3G, 4G or 5G) which uses IP as a transport layer technology. I find it funny in a
twisted sort of way, that to get the oldskool MPLS service, I have to turn on IPRAN.

Anyway, after changing the STM profile to _ipran_, the following partition is available:

**IPRAN** CAM    | S5648X (msw-core)            | S5624 (msw-top & msw-bottom)
---------------- | ---------------------------- | -----------------------------
MAC Addresses    | 32k                          | 98k
IPv4 routes      | host: 4k, indirect: 8k       | host: 12k, indirect: 56k
IPv6 routes      | host: 512, indirect: 512     | host: 2048, indirect: 1024
MPLS labels      | 6656                         | 6144
VPWS instances   | 1024                         | 1024
VPLS instances   | 64                           | 64
Port ACL entries | ingress: 1927, egress: 176   | ingress: 2976, egress: 928
VLAN ACL entries | ingress: 256, egress: 32     | ingress: 256, egress: 64

First off: there's quite a few differences here! The big switch has relatively few MAC, IPv4 and
IPv6 routes, compared to the little ones. But, it has a few more MPLS labels. ACL wise, the small
switch once again has a bit more capacity. But, of course the large switch has lots more ports (56
versus 26), and is more expensive. Choose wisely :)

Regarding IPv4/IPv6 and MPLS space, luckily [[AS8298]({{< ref "2021-02-27-network" >}})] is
relatively compact in its IGP. As of today, it carries 41 IPv4 and 48 IPv6 prefixes in OSPF, which
means that these switches would be fine participating in Area 0. If CAM space does turn into an
issue down the line, I can put them in stub areas and advertise only a default. As an aside, VPP
doesn't have any CAM at all, so for my routers the size is basically goverened by system memory
(which on modern computers equals "infinite routes").  As long as I keep it out of the DFZ, this
switch should be fine, for example in a BGP-free core that switches traffic based on VxLAN or MPLS,
but I digress.

#### L2

First let's test a straight forward configuration:

```
msw-top# configure terminal
msw-top(config)# vlan database
msw-top(config-vlan)# vlan 5-8
msw-top(config-vlan)# interface eth-0-1
msw-top(config-if)# switchport access vlan 5
msw-top(config-vlan)# interface eth-0-2
msw-top(config-if)# switchport access vlan 6
msw-top(config-vlan)# interface eth-0-3
msw-top(config-if)# switchport access vlan 7
msw-top(config-vlan)# interface eth-0-4
msw-top(config-if)# switchport mode dot1q-tunnel
msw-top(config-if)# switchport dot1q-tunnel native vlan 8
msw-top(config-vlan)# interface eth-0-26
msw-top(config-if)# switchport mode trunk
msw-top(config-if)# switchport trunk allowed vlan only 5-8
```

By means of demonstration, I created port `eth-0-4` as a QinQ capable port - which means that any
untagged frames coming into it will become VLAN 8, but any tagged frames will become s-tag 8 and
c-tag with whatever tag was sent, in other words standard issue QinQ tunneling. The configuration
of `msw-bottom` is exactly the same, and because we're connecting these VLANs through `msw-core`,
I'll have to make it a member of all these interfaces using the `interface range` shortcut:

```
msw-core# configure terminal
msw-core(config)# vlan database
msw-core(config-vlan)# vlan 5-8
msw-core(config-vlan)# interface range eth-0-49 - 50
msw-core(config-if)# switchport mode trunk
msw-core(config-if)# switchport trunk allowed vlan only 5-8
```

The loadtest results in T-Rex are, quite unsurprisingly, line rate. In the screenshot below, I'm
sending 128 byte frames at 8x10G (40G from `msw-top` through `msw-core` and out `msw-bottom`, and
40G in the other direction):

{{< image src="/assets/oem-switch/l2-trex.png" alt="L2 T-Rex" >}}

A few notes, for critical observers:
*   I have to use 128 byte frames because the T-Rex loadtester is armed with 3x Intel x710 NICs,
    which have a total packet rate of 40Mpps only. Intel made these with LACP redundancy in mind,
    and do not recommend fully loading them. As 64b frames would be ~59.52Mpps, the NIC won't keep
    up. So, I let T-Rex send 128b frames, which is ~33.8Mpps.
*   T-Rex shows only the first 4 ports in detail, and you can see all four ports are sending 10Gbps
    of L1 traffic, which at this frame size is 8.66Gbps of ethernet (as each frame also has a
    24 byte overhead [[ref](https://en.wikipedia.org/wiki/Ethernet_frame)]). We can clearly see
    though, that all Tx packets/sec are also Rx packets/sec, which means all traffic is safely
    accounted for.
*   In the top panel, you will see not 4x10, but **8x10Gbps and 67.62Mpps** of total throughput,
    with no traffic lost, and the loadtester CPU well within limits: 👍

```
msw-top# show int summary  | exc DOWN
RXBS: rx rate (bits/sec)        RXPS: rx rate (pkts/sec)
TXBS: tx rate (bits/sec)        TXPS: tx rate (pkts/sec)

Interface    Link  RXBS          RXPS          TXBS          TXPS
-----------------------------------------------------------------------------
eth-0-1      UP    10016060422   8459510       10016060652   8459510
eth-0-2      UP    10016080176   8459527       10016079835   8459526
eth-0-3      UP    10015294254   8458863       10015294258   8458863
eth-0-4      UP    10016083019   8459529       10016083126   8459529
eth-0-25     UP    449           0             501           0
eth-0-26     UP    41362394687   33837608      41362394527   33837608
```

Clearly, all three switches are happy to forward 40Gbps in both directions, and the 100G port is
happy to forward (at least) 40G symmetric - and because the uplink port is trunked, each ethernet
frame will be 4 bytes longer due to the dot1q tag, which, at 128b frames means we'll be using
132/128 * 4 * 10G == 41.3G of traffic, which it spot on.


#### L3

For this test, I will reconfigure the 100G ports to become routed rather than switched. Remember,
`msw-top` connects to `msw-core`, which in turn connects to `msw-bottom`, so I'll need two IPv4 /31
and two IPv6 /64 transit networks. I'll also create a loopback interface with a stable IPv4 and IPv6
address on each switch, and I'll tie all of these together in IPv4 and IPv6 OSPF in Area 0. The
configuration for the `msw-top` switch becomes:

```
msw-top# configure terminal
interface loopback0
 ip address 172.20.0.2/32
 ipv6 address 2001:678:d78:400::2/128
 ipv6 router ospf 8298 area 0
!
interface eth-0-26
 description Core: msw-core eth-0-49
 speed 100G
 no switchport
 mtu 9216
 ip address 172.20.0.11/31
 ipv6 address 2001:678:d78:400::2:2/112
 ip ospf network point-to-point
 ip ospf cost 1004
 ipv6 ospf network point-to-point
 ipv6 ospf cost 1006
 ipv6 router ospf 8298 area 0
!
router ospf 8298
 router-id 172.20.0.2
 network 172.20.0.0/22 area 0
 redistribute static
!
router ipv6 ospf 8298
 router-id 172.20.0.2
 redistribute static
```

Now that the IGP is up for IPv4 and IPv6 and I can ping the loopbacks from any switch to any other
switch, I can continue with the loadtest. I'll configure four IPv4 interfaces:

```
msw-top# configure terminal
interface eth-0-1
 no switchport
 ip address 100.65.1.1/30
!
interface eth-0-2
 no switchport
 ip address 100.65.2.1/30
!
interface eth-0-3
 no switchport
 ip address 100.65.3.1/30
!
interface eth-0-4
 no switchport
 ip address 100.65.4.1/30
!
ip route 16.0.1.0/24 100.65.1.2
ip route 16.0.2.0/24 100.65.2.2
ip route 16.0.3.0/24 100.65.3.2
ip route 16.0.4.0/24 100.65.4.2
```

After which I can see these transit networks and static routes propagate, through `msw-core`, and
into `msw-bottom`:

```
msw-bottom# show ip route
Codes: K - kernel, C - connected, S - static, R - RIP, B - BGP
       O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area
       Dc - DHCP Client
       [*] - [AD/Metric]
       * - candidate default

O        16.0.1.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56
O        16.0.2.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56
O        16.0.3.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56
O        16.0.4.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56
O        100.65.1.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
O        100.65.2.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
O        100.65.3.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
O        100.65.4.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
C        172.20.0.0/32 is directly connected, loopback0
O        172.20.0.1/32 [110/1005] via 172.20.0.9, eth-0-26, 05:50:48
O        172.20.0.2/32 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
C        172.20.0.8/31 is directly connected, eth-0-26
C        172.20.0.8/32 is in local loopback, eth-0-26
O        172.20.0.10/31 [110/1018] via 172.20.0.9, eth-0-26, 05:50:48
```

I now instruct the T-Rex loadtester to send single-flow loadtest traffic from 16.0.X.1 -> 48.0.X.1
on port 0; and back from 48.0.X.1 -> 16.0.X.1  on port 1; and then for port2+3 I use X=2, for
port4+5 I will use X=3, and port 6+7 I will use X=4. After T-Rex starts up, it's sending 80Gbps of
traffic with a grand total of 67.6Mpps in 8 unique flows of 8.45Mpps at 128b each, and the three
switches forward this L3 IPv4 unicast traffic effortlessly:

{{< image src="/assets/oem-switch/l3-trex.png" alt="L3 T-Rex" >}}

#### Overlay

What I've built just now would be acceptable really only if the switches were in the same rack (or
at best, facility). As an industry professional, I frown upon things like _VLAN-stretching_, a term
that describes bridging VLANs between buildings (or, as some might admit to .. between cities or
even countries🤮). A long time ago (in December 1999), Luca Martini invented what is now called [[Martini
Tunnels](https://datatracker.ietf.org/doc/html/draft-martini-l2circuit-trans-mpls-00)], defining how
to transport Ethernet frames over an MPLS network, which is what I really want to demonstrate, albeit
in the next article.

What folks don't always realize is that the industry is _moving on_ from MPLS to a set of more
flexible IP based solutions, notably tunneling using IPv4 or IPv6 UDP packets such as found in VxLAN
or GENEVE, two of my favorite protocols. This certainly does cost a little bit in VPP, as I wrote
about in my post on [[VLLs in VPP]({{< ref "2022-01-12-vpp-l2" >}})], although you'd be surprised
how many VxLAN encapsulated packets/sec a simple AMD64 router can forward. With respect to these
switches, though, let's find out if tunneling this way incurs an overhead or performance penalty.
Ready? Let's go!

First I will put the first four interfaces in range `eth-0-1 - 4` into a new set of VLANs, but in the
VLAN database I will enable what is called _overlay_ on them:

```
msw-top# configure terminal
vlan database
 vlan 5-8,10,20,30,40
 vlan 10 name v-vxlan-xco10
 vlan 10 overlay enable
 vlan 20 name v-vxlan-xco20
 vlan 20 overlay enable
 vlan 30 name v-vxlan-xco30
 vlan 30 overlay enable
 vlan 40 name v-vxlan-xco40
 vlan 40 overlay enable
!
interface eth-0-1
 switchport access vlan 10
!
interface eth-0-2
 switchport access vlan 20
!
interface eth-0-3
 switchport access vlan 30
!
interface eth-0-4
 switchport access vlan 40
```

Next, I create two new loopback interfaces (bear with me on this one), and configure the transport
of these overlays in the switch. This configuration will pick up the VLANs and move them to remote sites
in either VxLAN, GENEVE or NvGRE protocol, like this:

```
msw-top# configure terminal
!
interface loopback1
 ip address 172.20.1.2/32
!
interface loopback2
 ip address 172.20.2.2/32
!
overlay
 remote-vtep 1 ip-address 172.20.0.0 type vxlan src-ip 172.20.0.2
 remote-vtep 2 ip-address 172.20.1.0 type nvgre src-ip 172.20.1.2
 remote-vtep 3 ip-address 172.20.2.0 type geneve src-ip 172.20.2.2 keep-vlan-tag
 vlan 10 vni 829810
 vlan 10 remote-vtep 1
 vlan 20 vni 829820
 vlan 20 remote-vtep 2
 vlan 30 vni 829830
 vlan 30 remote-vtep 3
 vlan 40 vni 829840
 vlan 40 remote-vtep 1
!
```

Alright, this is seriously cool! The first overlay defines what is called a remote `VTEP` (virtual
tunnel end point), of type `VxLAN` towards IPv4 address `172.20.0.0`, coming from source address
`172.20.0.2` (which is our `loopback0` interface on switch `msw-top`). As it turns out, I am not
allowed to create different overlay _types_ to the same _destination_ address, but not to worry: I
can create a few unique loopback interfaces with unique IPv4 addresses (see `loopback1` and
`loopback2`; and create new VTEPs using these. So, VTEP at index 2 is of type `NvGRE` and the one at
index 3 is of type `GENEVE` and due to the use of `keep-vlan-tag`, the encapsulated traffic will
carry dot1q tags, where-as in the other two VTEPs the tag will be stripped and what is transported
on the wire is untagged traffic.

```
msw-top# show vlan all
VLAN ID  Name                           State   STP ID  Member ports
                                                        (u)-Untagged, (t)-Tagged
======= =============================== ======= ======= ========================
(...)
10      v-vxlan-xco10                   ACTIVE  0       eth-0-1(u)
                                                        VxLAN: 172.20.0.2->172.20.0.0
20      v-vxlan-xco20                   ACTIVE  0       eth-0-2(u)
                                                        NvGRE: 172.20.1.2->172.20.1.0
30      v-vxlan-xco30                   ACTIVE  0       eth-0-3(u)
                                                        GENEVE: 172.20.2.2->172.20.2.0
40      v-vxlan-xco40                   ACTIVE  0       eth-0-4(u)
                                                        VxLAN: 172.20.0.2->172.20.0.0
msw-top# show mac address-table
          Mac Address Table
-------------------------------------------
(*)  - Security Entry     (M)  - MLAG Entry
(MO) - MLAG Output Entry  (MI) - MLAG Input Entry
(E)  - EVPN Entry         (EO) - EVPN Output Entry
(EI) - EVPN Input Entry
Vlan    Mac Address       Type        Ports
----    -----------       --------    -----
10      6805.ca32.4595    dynamic     VxLAN: 172.20.0.2->172.20.0.0
10      6805.ca32.4594    dynamic     eth-0-1
20      6805.ca32.4596    dynamic     NvGRE: 172.20.1.2->172.20.1.0
20      6805.ca32.4597    dynamic     eth-0-2
30      9c69.b461.7679    dynamic     GENEVE: 172.20.2.2->172.20.2.0
30      9c69.b461.7678    dynamic     eth-0-3
40      9c69.b461.767a    dynamic     VxLAN: 172.20.0.2->172.20.0.0
40      9c69.b461.767b    dynamic     eth-0-4
```
Turning my attention to the VLAN database, I can now see the power of this become obvious. This
switch has any number of local interfaces either tagged or untagged (in the case of VLAN 10 we can
see `eth-0-1(u)` which means that interface is participating in the VLAN untagged), but we can also
see that this VLAN 10 has a member port called `VxLAN: 172.20.0.2->172.20.0.0`. This port is just
like any other, in that it'll participate in unknown unicast, broadcast and multicast, and "learn"
MAC addresses behind these virtual overlay ports. In VLAN 10 (and VLAN 40), I can see in the L2 FIB
(`show mac address-table`), that there's a local MAC address learned (from the T-Rex loadtester)
behind `eth-0-1`, but there's also a remote MAC address learned behind the VxLAN port. I'm
impressed.

I can add any number of VLANs (and dot1q-tunnels) into a VTEP endpoint, after assigning each of them
a unique `VNI` (virtual network identifier). If you're curious about these, take a look at the
[[VxLAN](https://datatracker.ietf.org/doc/html/rfc7348)], [[GENEVE](https://datatracker.ietf.org/doc/html/rfc8926)]
and [[NvGRE](https://www.rfc-editor.org/rfc/rfc7637.html)] specifications. Basically, the encapsulation is just putting the
ethernet frame as a payload of an UDP packet, so let's take a look at those.

#### Inspecting overlay

As you'll recall, the VLAN 10,20,30,40 trafffic is now traveling over an IP network, notably
encapsulated by the source switch `msw-top` and delivered to `msw-bottom` via IGP (in my case,
OSPF), while it transits through `msw-core`. I decide to take a look at this, by configuring a
monitor port on `msw-core`:

```
msw-core# show run | inc moni
monitor session 1 source interface eth-0-49 both
monitor session 1 destination interface eth-0-1
```

This will copy all in- and egress traffic from interface `eth-0-49` (connected to `msw-top`) through
to local interface `eth-0-1`, which is connected to the loadtester. I can simply tcpdump this stuff:

```
pim@trex01:~$ sudo tcpdump -ni eno2 '(proto gre) or (udp and port 4789) or (udp and port 6081)'
01:26:24.685666 00:1e:08:26:ec:f3 > 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 174:
  (tos 0x0, ttl 127, id 7496, offset 0, flags [DF], proto UDP (17), length 160)
  172.20.0.0.49208 > 172.20.0.2.4789: VXLAN, flags [I] (0x08), vni 829810
    68:05:ca:32:45:95 > 68:05:ca:32:45:94, ethertype IPv4 (0x0800), length 124:
    (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110)
      48.0.1.47.1025 > 16.0.1.47.12: UDP, length 82
01:26:24.688305 00:1e:08:0d:6e:88 > 00:1e:08:26:ec:f3, ethertype IPv4 (0x0800), length 166:
  (tos 0x0, ttl 128, id 44814, offset 0, flags [DF], proto GRE (47), length 152)
  172.20.1.2 > 172.20.1.0: GREv0, Flags [key present], key=0xca97c38, proto TEB (0x6558), length 132
    68:05:ca:32:45:97 > 68:05:ca:32:45:96, ethertype IPv4 (0x0800), length 124:
    (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110)
      48.0.2.73.1025 > 16.0.2.73.12: UDP, length 82
01:26:24.689100 00:1e:08:26:ec:f3 > 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 178:
  (tos 0x0, ttl 127, id 7502, offset 0, flags [DF], proto UDP (17), length 164)
  172.20.2.0.49208 > 172.20.2.2.6081: GENEVE, Flags [none], vni 0xca986, proto TEB (0x6558)
    9c:69:b4:61:76:79 > 9c:69:b4:61:76:78, ethertype 802.1Q (0x8100), length 128: vlan 30, p 0, ethertype IPv4 (0x0800),
    (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110)
      48.0.3.109.1025 > 16.0.3.109.12: UDP, length 82
01:26:24.701666 00:1e:08:0d:6e:89 > 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 174:
  (tos 0x0, ttl 127, id 7496, offset 0, flags [DF], proto UDP (17), length 160)
  172.20.0.0.49208 > 172.20.0.2.4789: VXLAN, flags [I] (0x08), vni 829840
    68:05:ca:32:45:95 > 68:05:ca:32:45:94, ethertype IPv4 (0x0800), length 124:
    (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110)
      48.0.4.47.1025 > 16.0.4.47.12: UDP, length 82
```

We can see packets for all four tunnels in this dump. The first one is a UDP packet to port 4789,
which is the standard port for VxLAN, and it has VNI 829810. The second packet is proto GRE with
flag `TEB` which stands for _transparent ethernet bridge_ in other words an L2 variant of GRE that
carries ethernet frames. The third one shows that feature I configured above (in case you forgot it,
it's the `keep-vlan-tag` option when creating the VTEP), and because of that flag we can see that
the inner payload carries the `vlan 30` tag, neat! The `VNI` there is `0xca986` which is hex for
`829830`.  Finally, the fourth one shows VLAN40 traffic that is sent to the same VTEP endpoint as
VLAN10 traffic (showing that multiple VLANs can be transported across the same tunnel, distinguished
by VNI).

{{< image width="90px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}

At this point I make an important observation. VxLAN and GENEVE both have this really cool feature
that they can hash their _inner_ payload (ie. the IPv4/IPv6 address and ports if available) and use
that to randomize the _source port_, which makes them preferable to GRE. The reason why this is
preferable is hashing makes these inner flows become unique outer flows, which in turn allows them
to be loadbalanced in intermediate networks, but also in the receiver if it has multiple receive
queues.  However, and this is important!, the switch does not hash, which means that all ethernet
traffic in the VxLAN, GENEVE and NvGRE tunnels always have the exact same outer header, so
loadbalancing and multiple receive queues are out of the question. I wonder if this is a limitation
of the Centec chip, or failure to program or configure it by the firmware.

With that gripe out of the way, let's take a look at 80Gbit of tunneled traffic, shall we?

{{< image src="/assets/oem-switch/overlay-trex.png" alt="Overlay T-Rex" >}}

Once again, all three switches are acing it. So at least 40Gbps of encap- and 40Gbps of decapsulation
per switch, and the transport over IPv4 through the `msw-core` switch to the other side, is all in
working order. On top of that, I've shown that multiple types of overlay can live alongside one
another, even betwween the same pair of switches, and that multiple VLANs can share the same underlay
transport. The only downside is the **single flow** nature of these UDP transports.

A final inspection of the switch throughput:

```
msw-top# show interface summary | exc DOWN
RXBS: rx rate (bits/sec)        RXPS: rx rate (pkts/sec)
TXBS: tx rate (bits/sec)        TXPS: tx rate (pkts/sec)

Interface    Link  RXBS          RXPS          TXBS          TXPS
-----------------------------------------------------------------------------
eth-0-1      UP    10013004482   8456929       10013004548   8456929
eth-0-2      UP    10013030687   8456951       10013030801   8456951
eth-0-3      UP    10012625863   8456609       10012626030   8456609
eth-0-4      UP    10013032737   8456953       10013034423   8456954
eth-0-25     UP    505           0             513           0
eth-0-26     UP    51147539721   33827761      51147540111   33827762

```

Take a look at that `eth-0-26` interface: it's using significantly more bandwidth (51Gbps) than
the sum of the four transports (4x10Gbps). This is because each ethernet frame (of 128b) has to be
wrapped in an IPv4 UDP packet (or in the case of NvGRE an IPv4 packet with a GRE header), which
incurs quite some overhead, for small packets at least. But it definitely proves that the switches
here are happy to do this forwarding at line rate, and that's what counts!

### Conclusions

It's just super cool to see a switch like this work as expected. I did not manage to overload it at
all, neither with IPv4 loadtest at 67Mpps and 80Gbit of traffic, nor with L2 loadtest with four ports
transported with VxLAN, NvGRE and GENEVE, _at the same time_. Although the underlay can only use
IPv4 (no IPv6 is available in the switch chip), this is not a huge problem for me. At AS8298, I can
easily define some private VRF with IPv4 space from RFC1918 to do the transport of traffic over
VxLAN. And what's even better, this can perfectly inter-operate with my VPP routers which also do
VxLAN en/decapsulation.

Now there is one more thing for me to test (and, cliffhanger, I've tested it already but I'll have
to write up all of my data and results ...). I need to do what I said I would do in the beginning of
this article, and what I had hoped to achieve with the FS switches but failed to due to lack of
support: MPLS L2VPN transport (and, its more complex but cooler sibling VPLS).