641 lines
33 KiB
Markdown
641 lines
33 KiB
Markdown
---
|
|
date: "2022-12-05T11:56:54Z"
|
|
title: 'Review: S5648X-2Q4Z Switch - Part 1: VxLAN/GENEVE/NvGRE'
|
|
aliases:
|
|
- /s/articles/2022/12/05/oem-switch-1.html
|
|
---
|
|
|
|
After receiving an e-mail from a newer [[China based switch OEM](https://starry-networks.com/)], I
|
|
had a chat with their founder and learned that the combination of switch silicon and software may be
|
|
a good match for IPng Networks. You may recall my previous endeavors in the Fiberstore lineup,
|
|
notably an in-depth review of the [[S5860-20SQ]({{< ref "2021-08-07-fs-switch" >}})] which sports
|
|
20x10G, 4x25G and 2x40G optics, and its larger sibling the S5860-48SC which comes with 48x10G and
|
|
8x100G cages. I use them in production at IPng Networks and their featureset versus price point is
|
|
pretty good. In that article, I made one critical note reviewing those FS switches, in that they'e
|
|
be a better fit if they allowed for MPLS or IP based L2VPN services in hardware.
|
|
|
|
{{< image width="450px" float="left" src="/assets/oem-switch/S5624X-front.png" alt="S5624X Front" >}}
|
|
|
|
{{< image width="450px" float="right" src="/assets/oem-switch/S5648X-front.png" alt="S5648X Front" >}}
|
|
|
|
<br /><br />
|
|
|
|
I got cautiously enthusiastic (albeit suitably skeptical) when this new vendor claimed VxLAN,
|
|
GENEVE, MPLS and GRE at 56 ports and line rate, on a really affordable budget (sub-$4K for the 56
|
|
port; and sub-$2K for the 26 port switch). This reseller is using a less known silicon vendor called
|
|
[[Centec](https://www.centec.com/silicon)], who have a lineup of ethernet silicon. In this device,
|
|
the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired
|
|
with 4x100GbE uplink capability. This is Centec's fourth generation, so CTC8096 inherits the feature
|
|
set from L2/L3 switching to advanced data center and metro Ethernet features with innovative
|
|
enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE
|
|
ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS SR,
|
|
and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic management,
|
|
and Network time synchronization.
|
|
|
|
This will be the first of a set of write-ups exploring the hard- and software functionality of this new
|
|
vendor. As we'll see, it's all about the _software_.
|
|
|
|
## Detailed findings
|
|
|
|
### Hardware
|
|
|
|
{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-front-opencase.png" alt="Front" >}}
|
|
|
|
The switch comes well packaged with two removable 400W Gold powersupplies from _Compuware
|
|
Technology_ which output 12V/33A and +5V/3A as well as four removable PWM controlled fans from
|
|
_Protechnic_. The fans are expelling air, so they are cooling front-to-back on this unit. Looking at
|
|
the fans, changing them to pull air back-to-front would be possible after-sale, by flipping the fans
|
|
around as they're attached in their case by two M4 flat-head screws. This is truly meant to be
|
|
an OEM switch -- there is no logo or sticker with the vendor's name, so I should probably print a
|
|
few vinyl IPng stickers to skin them later.
|
|
|
|
{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-daughterboard.png" alt="S5648 Daughterboard" >}}
|
|
|
|
On the front, the switch sports an RJ45 standard serial console, a mini-usb connector of which the
|
|
function is not clear to me, an RJ45 network port used for management, a pinhole which houses a
|
|
reset button labeled `RST` and two LED indicators labeled `ID` and `SYS`. The serial port runs at
|
|
115200,8n1 and the managment network port is Gigabit.
|
|
|
|
Regarding the regular switch ports, there are 48x SFP+ cages, 4x QSFP28 (port 49-52) runing at
|
|
100Gbit, and 2x QSFP+ ports (53-54) running at 40Gbit. All ports (management and switch) present a
|
|
MAC address from OUI `00-1E-08`, which is assigned to Centec.
|
|
|
|
The switch is not particularly quiet, as its six fans total start up at a high pitch but once the
|
|
switch boots, they calm down and emit noise levels as you would expect from a datacenter unit. I
|
|
measured it at 74dBA when booting, and otherwise at around 62dBA when running. On the inside, the
|
|
PCB is rather clean. It comes with a daughter board, housing a small PowerPC P1010 with 533MHz CPU,
|
|
1GB of RAM, and 2GB flash on board, which is running Linux. This is the same card that many of the
|
|
FS.com switches use (eg. S5860-48S6Q), a cheaper alternative to the high end Intel Xeon-D.
|
|
|
|
{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-switchchip-block.png" alt="S5648 Switchchip" >}}
|
|
|
|
#### S5648X (48x10, 2x40, 4x100)
|
|
|
|
There is one switch chip, on the front of the PCB, connecting all 54 ports. It has a sizable
|
|
heatsink on it, drawing air backwards through ports (36-48). The switch uses a less well known and
|
|
somewhat dated Centec [[CTC8096](https://www.centec.com/silicon/Ethernet-Switch-Silicon/CTC8096)],
|
|
codenamed GoldenGate and released in 2015, which is rated for 1.2Tbps of aggregate throughput. The
|
|
chip can be programmed to handle a bunch of SDN protocols, including VxLAN, GRE, GENEVE, and MPLS /
|
|
MPLS SR, with a limited TCAM to hold things like ACLs, IPv4/IPv6 routes and MPLS labels. The CTC8096
|
|
provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE ports. The SerDES design is
|
|
pretty flexible, allowing it to mix and match ports.
|
|
|
|
You can see more (hires) pictures and screenshots throughout these articles in this [[Photo
|
|
Album](https://photos.app.goo.gl/Mxzs38p355Bo4qZB6)].
|
|
|
|
#### S5624X (24x10, 2x100)
|
|
|
|
In case you're curious (I certainly was!) the smaller unit (with 24x10+2x100) is built off of the
|
|
Centic [[CTC7132](https://www.centec.com/silicon/Ethernet-Switch-Silicon/CTC7132)], codenamed
|
|
TsingMa which released in 2019, and it offers a variety of similar features, including L2, L3, MPLS,
|
|
VXLAN, MPLS SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and
|
|
traffic management, and Network time synchronization. The SoC has an embedded ARM A53 CPU Core
|
|
running at 800MHz, and the SerDES on this chip allows for 24x1G/2.5G/5G/10G and 2x40G/100G for a
|
|
throughput of 440Gbps but at a failry sharp price point.
|
|
|
|
One thing worth noting (because I know some of my regular readers will already be wondering!) this
|
|
series of chips (both the 4th generation CTC8096 and the sixth generation CTC7132) come with a very
|
|
modest TCAM, which means in practice 32K MAC addresses, 8K IPv4 routes, 1K IPv6 routes, a 6K MPLS
|
|
table, 1K L2VPN instances, and 64 VPLS instances. The Centec comes as well with a modest 32MB packet
|
|
buffer shared between all ports, and the controlplane comes with 1GB of memory and a 533MHz ARM. So
|
|
no, this won't take a full table :-) but in all honesty, that's not the thing this machine is built
|
|
to do.
|
|
|
|
When booted, the switch draws roughly 68 Watts combined on its two power supplies, and I find that
|
|
pretty cool considering the total throughput offered. Of course, once optics are inserted, the total
|
|
power draw will go up. Also worth noting, when the switch is under load, the Centec chip will
|
|
consume more power, for example when forwarding 8x10G + 2x100G, the total consumption was 88 Watts,
|
|
totally respectable now that datacenter power bills are skyrocketing.
|
|
|
|
### Topology
|
|
|
|
{{< image width="500px" float="right" src="/assets/oem-switch/topology.svg" alt="Front" >}}
|
|
|
|
On the heals of my [[DENOG14 Talk](/media/denog14/index.html)], in which I showed how VPP can
|
|
_route_ 150Mpps and 180Gbps on a 10 year old Dell while consuming a full 1M BGP table in 7 seconds
|
|
or so, I still had a little bit of a LAB left to repurpose. So I build the following topology using
|
|
the loadtester, packet analyzer, and switches:
|
|
|
|
* **msw-top**: S5624-2Z-EI switch
|
|
* **msw-core**: S5648X-2Q4ZA switch
|
|
* **msw-bottom**: S5624-2Z-EI switch
|
|
* All switches connect to:
|
|
* each other with 100G DACs (right, black)
|
|
* T-Rex machine with 4x10G (left, rainbow)
|
|
* Each switch gets a mgmt IPv4 and IPv6
|
|
|
|
With this topology I will have enough wiggle room to patch anything to anything. Now that the
|
|
physical part is out of the way, let's take a look at the firmware of these things!
|
|
|
|
### Software
|
|
|
|
As can be seen in the topology above, I am testing three of these switches - two are the smaller sibling
|
|
[S5624X 2Z-EI] (which come with 24x10G SFP+ and 2x100G QSFP28), and one is this [S5648X 2Q4Z]
|
|
pictured above. The vendor has a licensing system, for basic L2, basic L3 and advanced metro L3.
|
|
These switches come with the most advanced/liberal licenses, which means all of the features will
|
|
work on the switches, notably, MPLS/LDP and VPWS/VPLS.
|
|
|
|
Taking a look at the CLI, it's very Cisco IOS-esque; there's a few small differences, but the look
|
|
and feel is definitely familiar. Base configuration kind of looks like this:
|
|
|
|
#### Basic config
|
|
|
|
```
|
|
msw-core# show running-config
|
|
management ip address 192.168.1.33/24
|
|
management route add gateway 192.168.1.252
|
|
!
|
|
ntp server 216.239.35.4
|
|
ntp server 216.239.35.8
|
|
ntp mgmt-if enable
|
|
!
|
|
snmp-server enable
|
|
snmp-server system-contact noc@ipng.ch
|
|
snmp-server system-location Bruttisellen, Switzerland
|
|
snmp-server community public read-only
|
|
snmp-server version v2c
|
|
|
|
msw-core# conf t
|
|
msw-core(config)# stm prefer ipran
|
|
```
|
|
|
|
A few small things of note. There is no `mgmt0` device as I would've expected. Instead, the SoC
|
|
exposes its management interface to be configured with these `management ...` commands. The IPv4 can
|
|
be either DHCP or a static address, and IPv6 can only do static addresses. Only one (default)
|
|
gateway can be set for either protocol. Then, NTP can be set up to work on the `mgmt-if` which is a
|
|
useful way to use it for timekeeping.
|
|
|
|
The SNMP server works both from the `mgmt-if` and from the dataplane, which is nice.
|
|
SNMP supports everything you'd expect, including v3 and traps for all sorts of events, including
|
|
IPv6 targets and either dataplane or `mgmt-if`.
|
|
|
|
I did notice that the nameserver cannot use the `mgmt-if`, so I left it unconfigured. I found it a
|
|
little bit odd, considering all the other functionality does work just fine over the `mgmt-if`.
|
|
|
|
If you've run CAM-based systems before, you'll likely have come across some form of _partitioning_
|
|
mechanism, to allow certain types in the CAM (eg. IPv4, IPv6, L2 MACs, MPLS labels, ACLs) to have
|
|
more or fewer entries. This is particularly relevant on this switch because it has a comparatively
|
|
small CAM. It turns out, that by default MPLS is entirely disabled, and to turn it on (and sacrifice
|
|
some of that sweet sweet content addressable memory), I have to issue the command `stm prefer ipran`
|
|
(other flavors are _ipv6_, _layer3_, _ptn_, and _default_), and reload the switch.
|
|
|
|
Having been in the networking industry for a while, I scratched my head on the acronym **IPRAN**, so
|
|
I will admit having to look it up. It's a general term used to describe an IP based Radio Access
|
|
Network (2G, 3G, 4G or 5G) which uses IP as a transport layer technology. I find it funny in a
|
|
twisted sort of way, that to get the oldskool MPLS service, I have to turn on IPRAN.
|
|
|
|
Anyway, after changing the STM profile to _ipran_, the following partition is available:
|
|
|
|
**IPRAN** CAM | S5648X (msw-core) | S5624 (msw-top & msw-bottom)
|
|
---------------- | ---------------------------- | -----------------------------
|
|
MAC Addresses | 32k | 98k
|
|
IPv4 routes | host: 4k, indirect: 8k | host: 12k, indirect: 56k
|
|
IPv6 routes | host: 512, indirect: 512 | host: 2048, indirect: 1024
|
|
MPLS labels | 6656 | 6144
|
|
VPWS instances | 1024 | 1024
|
|
VPLS instances | 64 | 64
|
|
Port ACL entries | ingress: 1927, egress: 176 | ingress: 2976, egress: 928
|
|
VLAN ACL entries | ingress: 256, egress: 32 | ingress: 256, egress: 64
|
|
|
|
First off: there's quite a few differences here! The big switch has relatively few MAC, IPv4 and
|
|
IPv6 routes, compared to the little ones. But, it has a few more MPLS labels. ACL wise, the small
|
|
switch once again has a bit more capacity. But, of course the large switch has lots more ports (56
|
|
versus 26), and is more expensive. Choose wisely :)
|
|
|
|
Regarding IPv4/IPv6 and MPLS space, luckily [[AS8298]({{< ref "2021-02-27-network" >}})] is
|
|
relatively compact in its IGP. As of today, it carries 41 IPv4 and 48 IPv6 prefixes in OSPF, which
|
|
means that these switches would be fine participating in Area 0. If CAM space does turn into an
|
|
issue down the line, I can put them in stub areas and advertise only a default. As an aside, VPP
|
|
doesn't have any CAM at all, so for my routers the size is basically goverened by system memory
|
|
(which on modern computers equals "infinite routes"). As long as I keep it out of the DFZ, this
|
|
switch should be fine, for example in a BGP-free core that switches traffic based on VxLAN or MPLS,
|
|
but I digress.
|
|
|
|
#### L2
|
|
|
|
First let's test a straight forward configuration:
|
|
|
|
```
|
|
msw-top# configure terminal
|
|
msw-top(config)# vlan database
|
|
msw-top(config-vlan)# vlan 5-8
|
|
msw-top(config-vlan)# interface eth-0-1
|
|
msw-top(config-if)# switchport access vlan 5
|
|
msw-top(config-vlan)# interface eth-0-2
|
|
msw-top(config-if)# switchport access vlan 6
|
|
msw-top(config-vlan)# interface eth-0-3
|
|
msw-top(config-if)# switchport access vlan 7
|
|
msw-top(config-vlan)# interface eth-0-4
|
|
msw-top(config-if)# switchport mode dot1q-tunnel
|
|
msw-top(config-if)# switchport dot1q-tunnel native vlan 8
|
|
msw-top(config-vlan)# interface eth-0-26
|
|
msw-top(config-if)# switchport mode trunk
|
|
msw-top(config-if)# switchport trunk allowed vlan only 5-8
|
|
```
|
|
|
|
By means of demonstration, I created port `eth-0-4` as a QinQ capable port - which means that any
|
|
untagged frames coming into it will become VLAN 8, but any tagged frames will become s-tag 8 and
|
|
c-tag with whatever tag was sent, in other words standard issue QinQ tunneling. The configuration
|
|
of `msw-bottom` is exactly the same, and because we're connecting these VLANs through `msw-core`,
|
|
I'll have to make it a member of all these interfaces using the `interface range` shortcut:
|
|
|
|
```
|
|
msw-core# configure terminal
|
|
msw-core(config)# vlan database
|
|
msw-core(config-vlan)# vlan 5-8
|
|
msw-core(config-vlan)# interface range eth-0-49 - 50
|
|
msw-core(config-if)# switchport mode trunk
|
|
msw-core(config-if)# switchport trunk allowed vlan only 5-8
|
|
```
|
|
|
|
The loadtest results in T-Rex are, quite unsurprisingly, line rate. In the screenshot below, I'm
|
|
sending 128 byte frames at 8x10G (40G from `msw-top` through `msw-core` and out `msw-bottom`, and
|
|
40G in the other direction):
|
|
|
|
{{< image src="/assets/oem-switch/l2-trex.png" alt="L2 T-Rex" >}}
|
|
|
|
A few notes, for critical observers:
|
|
* I have to use 128 byte frames because the T-Rex loadtester is armed with 3x Intel x710 NICs,
|
|
which have a total packet rate of 40Mpps only. Intel made these with LACP redundancy in mind,
|
|
and do not recommend fully loading them. As 64b frames would be ~59.52Mpps, the NIC won't keep
|
|
up. So, I let T-Rex send 128b frames, which is ~33.8Mpps.
|
|
* T-Rex shows only the first 4 ports in detail, and you can see all four ports are sending 10Gbps
|
|
of L1 traffic, which at this frame size is 8.66Gbps of ethernet (as each frame also has a
|
|
24 byte overhead [[ref](https://en.wikipedia.org/wiki/Ethernet_frame)]). We can clearly see
|
|
though, that all Tx packets/sec are also Rx packets/sec, which means all traffic is safely
|
|
accounted for.
|
|
* In the top panel, you will see not 4x10, but **8x10Gbps and 67.62Mpps** of total throughput,
|
|
with no traffic lost, and the loadtester CPU well within limits: 👍
|
|
|
|
```
|
|
msw-top# show int summary | exc DOWN
|
|
RXBS: rx rate (bits/sec) RXPS: rx rate (pkts/sec)
|
|
TXBS: tx rate (bits/sec) TXPS: tx rate (pkts/sec)
|
|
|
|
Interface Link RXBS RXPS TXBS TXPS
|
|
-----------------------------------------------------------------------------
|
|
eth-0-1 UP 10016060422 8459510 10016060652 8459510
|
|
eth-0-2 UP 10016080176 8459527 10016079835 8459526
|
|
eth-0-3 UP 10015294254 8458863 10015294258 8458863
|
|
eth-0-4 UP 10016083019 8459529 10016083126 8459529
|
|
eth-0-25 UP 449 0 501 0
|
|
eth-0-26 UP 41362394687 33837608 41362394527 33837608
|
|
```
|
|
|
|
Clearly, all three switches are happy to forward 40Gbps in both directions, and the 100G port is
|
|
happy to forward (at least) 40G symmetric - and because the uplink port is trunked, each ethernet
|
|
frame will be 4 bytes longer due to the dot1q tag, which, at 128b frames means we'll be using
|
|
132/128 * 4 * 10G == 41.3G of traffic, which it spot on.
|
|
|
|
|
|
#### L3
|
|
|
|
For this test, I will reconfigure the 100G ports to become routed rather than switched. Remember,
|
|
`msw-top` connects to `msw-core`, which in turn connects to `msw-bottom`, so I'll need two IPv4 /31
|
|
and two IPv6 /64 transit networks. I'll also create a loopback interface with a stable IPv4 and IPv6
|
|
address on each switch, and I'll tie all of these together in IPv4 and IPv6 OSPF in Area 0. The
|
|
configuration for the `msw-top` switch becomes:
|
|
|
|
```
|
|
msw-top# configure terminal
|
|
interface loopback0
|
|
ip address 172.20.0.2/32
|
|
ipv6 address 2001:678:d78:400::2/128
|
|
ipv6 router ospf 8298 area 0
|
|
!
|
|
interface eth-0-26
|
|
description Core: msw-core eth-0-49
|
|
speed 100G
|
|
no switchport
|
|
mtu 9216
|
|
ip address 172.20.0.11/31
|
|
ipv6 address 2001:678:d78:400::2:2/112
|
|
ip ospf network point-to-point
|
|
ip ospf cost 1004
|
|
ipv6 ospf network point-to-point
|
|
ipv6 ospf cost 1006
|
|
ipv6 router ospf 8298 area 0
|
|
!
|
|
router ospf 8298
|
|
router-id 172.20.0.2
|
|
network 172.20.0.0/22 area 0
|
|
redistribute static
|
|
!
|
|
router ipv6 ospf 8298
|
|
router-id 172.20.0.2
|
|
redistribute static
|
|
```
|
|
|
|
Now that the IGP is up for IPv4 and IPv6 and I can ping the loopbacks from any switch to any other
|
|
switch, I can continue with the loadtest. I'll configure four IPv4 interfaces:
|
|
|
|
```
|
|
msw-top# configure terminal
|
|
interface eth-0-1
|
|
no switchport
|
|
ip address 100.65.1.1/30
|
|
!
|
|
interface eth-0-2
|
|
no switchport
|
|
ip address 100.65.2.1/30
|
|
!
|
|
interface eth-0-3
|
|
no switchport
|
|
ip address 100.65.3.1/30
|
|
!
|
|
interface eth-0-4
|
|
no switchport
|
|
ip address 100.65.4.1/30
|
|
!
|
|
ip route 16.0.1.0/24 100.65.1.2
|
|
ip route 16.0.2.0/24 100.65.2.2
|
|
ip route 16.0.3.0/24 100.65.3.2
|
|
ip route 16.0.4.0/24 100.65.4.2
|
|
```
|
|
|
|
After which I can see these transit networks and static routes propagate, through `msw-core`, and
|
|
into `msw-bottom`:
|
|
|
|
```
|
|
msw-bottom# show ip route
|
|
Codes: K - kernel, C - connected, S - static, R - RIP, B - BGP
|
|
O - OSPF, IA - OSPF inter area
|
|
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
|
|
E1 - OSPF external type 1, E2 - OSPF external type 2
|
|
i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area
|
|
Dc - DHCP Client
|
|
[*] - [AD/Metric]
|
|
* - candidate default
|
|
|
|
O 16.0.1.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56
|
|
O 16.0.2.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56
|
|
O 16.0.3.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56
|
|
O 16.0.4.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56
|
|
O 100.65.1.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
|
|
O 100.65.2.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
|
|
O 100.65.3.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
|
|
O 100.65.4.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
|
|
C 172.20.0.0/32 is directly connected, loopback0
|
|
O 172.20.0.1/32 [110/1005] via 172.20.0.9, eth-0-26, 05:50:48
|
|
O 172.20.0.2/32 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
|
|
C 172.20.0.8/31 is directly connected, eth-0-26
|
|
C 172.20.0.8/32 is in local loopback, eth-0-26
|
|
O 172.20.0.10/31 [110/1018] via 172.20.0.9, eth-0-26, 05:50:48
|
|
```
|
|
|
|
I now instruct the T-Rex loadtester to send single-flow loadtest traffic from 16.0.X.1 -> 48.0.X.1
|
|
on port 0; and back from 48.0.X.1 -> 16.0.X.1 on port 1; and then for port2+3 I use X=2, for
|
|
port4+5 I will use X=3, and port 6+7 I will use X=4. After T-Rex starts up, it's sending 80Gbps of
|
|
traffic with a grand total of 67.6Mpps in 8 unique flows of 8.45Mpps at 128b each, and the three
|
|
switches forward this L3 IPv4 unicast traffic effortlessly:
|
|
|
|
{{< image src="/assets/oem-switch/l3-trex.png" alt="L3 T-Rex" >}}
|
|
|
|
#### Overlay
|
|
|
|
What I've built just now would be acceptable really only if the switches were in the same rack (or
|
|
at best, facility). As an industry professional, I frown upon things like _VLAN-stretching_, a term
|
|
that describes bridging VLANs between buildings (or, as some might admit to .. between cities or
|
|
even countries🤮). A long time ago (in December 1999), Luca Martini invented what is now called [[Martini
|
|
Tunnels](https://datatracker.ietf.org/doc/html/draft-martini-l2circuit-trans-mpls-00)], defining how
|
|
to transport Ethernet frames over an MPLS network, which is what I really want to demonstrate, albeit
|
|
in the next article.
|
|
|
|
What folks don't always realize is that the industry is _moving on_ from MPLS to a set of more
|
|
flexible IP based solutions, notably tunneling using IPv4 or IPv6 UDP packets such as found in VxLAN
|
|
or GENEVE, two of my favorite protocols. This certainly does cost a little bit in VPP, as I wrote
|
|
about in my post on [[VLLs in VPP]({{< ref "2022-01-12-vpp-l2" >}})], although you'd be surprised
|
|
how many VxLAN encapsulated packets/sec a simple AMD64 router can forward. With respect to these
|
|
switches, though, let's find out if tunneling this way incurs an overhead or performance penalty.
|
|
Ready? Let's go!
|
|
|
|
First I will put the first four interfaces in range `eth-0-1 - 4` into a new set of VLANs, but in the
|
|
VLAN database I will enable what is called _overlay_ on them:
|
|
|
|
```
|
|
msw-top# configure terminal
|
|
vlan database
|
|
vlan 5-8,10,20,30,40
|
|
vlan 10 name v-vxlan-xco10
|
|
vlan 10 overlay enable
|
|
vlan 20 name v-vxlan-xco20
|
|
vlan 20 overlay enable
|
|
vlan 30 name v-vxlan-xco30
|
|
vlan 30 overlay enable
|
|
vlan 40 name v-vxlan-xco40
|
|
vlan 40 overlay enable
|
|
!
|
|
interface eth-0-1
|
|
switchport access vlan 10
|
|
!
|
|
interface eth-0-2
|
|
switchport access vlan 20
|
|
!
|
|
interface eth-0-3
|
|
switchport access vlan 30
|
|
!
|
|
interface eth-0-4
|
|
switchport access vlan 40
|
|
```
|
|
|
|
Next, I create two new loopback interfaces (bear with me on this one), and configure the transport
|
|
of these overlays in the switch. This configuration will pick up the VLANs and move them to remote sites
|
|
in either VxLAN, GENEVE or NvGRE protocol, like this:
|
|
|
|
```
|
|
msw-top# configure terminal
|
|
!
|
|
interface loopback1
|
|
ip address 172.20.1.2/32
|
|
!
|
|
interface loopback2
|
|
ip address 172.20.2.2/32
|
|
!
|
|
overlay
|
|
remote-vtep 1 ip-address 172.20.0.0 type vxlan src-ip 172.20.0.2
|
|
remote-vtep 2 ip-address 172.20.1.0 type nvgre src-ip 172.20.1.2
|
|
remote-vtep 3 ip-address 172.20.2.0 type geneve src-ip 172.20.2.2 keep-vlan-tag
|
|
vlan 10 vni 829810
|
|
vlan 10 remote-vtep 1
|
|
vlan 20 vni 829820
|
|
vlan 20 remote-vtep 2
|
|
vlan 30 vni 829830
|
|
vlan 30 remote-vtep 3
|
|
vlan 40 vni 829840
|
|
vlan 40 remote-vtep 1
|
|
!
|
|
```
|
|
|
|
Alright, this is seriously cool! The first overlay defines what is called a remote `VTEP` (virtual
|
|
tunnel end point), of type `VxLAN` towards IPv4 address `172.20.0.0`, coming from source address
|
|
`172.20.0.2` (which is our `loopback0` interface on switch `msw-top`). As it turns out, I am not
|
|
allowed to create different overlay _types_ to the same _destination_ address, but not to worry: I
|
|
can create a few unique loopback interfaces with unique IPv4 addresses (see `loopback1` and
|
|
`loopback2`; and create new VTEPs using these. So, VTEP at index 2 is of type `NvGRE` and the one at
|
|
index 3 is of type `GENEVE` and due to the use of `keep-vlan-tag`, the encapsulated traffic will
|
|
carry dot1q tags, where-as in the other two VTEPs the tag will be stripped and what is transported
|
|
on the wire is untagged traffic.
|
|
|
|
```
|
|
msw-top# show vlan all
|
|
VLAN ID Name State STP ID Member ports
|
|
(u)-Untagged, (t)-Tagged
|
|
======= =============================== ======= ======= ========================
|
|
(...)
|
|
10 v-vxlan-xco10 ACTIVE 0 eth-0-1(u)
|
|
VxLAN: 172.20.0.2->172.20.0.0
|
|
20 v-vxlan-xco20 ACTIVE 0 eth-0-2(u)
|
|
NvGRE: 172.20.1.2->172.20.1.0
|
|
30 v-vxlan-xco30 ACTIVE 0 eth-0-3(u)
|
|
GENEVE: 172.20.2.2->172.20.2.0
|
|
40 v-vxlan-xco40 ACTIVE 0 eth-0-4(u)
|
|
VxLAN: 172.20.0.2->172.20.0.0
|
|
msw-top# show mac address-table
|
|
Mac Address Table
|
|
-------------------------------------------
|
|
(*) - Security Entry (M) - MLAG Entry
|
|
(MO) - MLAG Output Entry (MI) - MLAG Input Entry
|
|
(E) - EVPN Entry (EO) - EVPN Output Entry
|
|
(EI) - EVPN Input Entry
|
|
Vlan Mac Address Type Ports
|
|
---- ----------- -------- -----
|
|
10 6805.ca32.4595 dynamic VxLAN: 172.20.0.2->172.20.0.0
|
|
10 6805.ca32.4594 dynamic eth-0-1
|
|
20 6805.ca32.4596 dynamic NvGRE: 172.20.1.2->172.20.1.0
|
|
20 6805.ca32.4597 dynamic eth-0-2
|
|
30 9c69.b461.7679 dynamic GENEVE: 172.20.2.2->172.20.2.0
|
|
30 9c69.b461.7678 dynamic eth-0-3
|
|
40 9c69.b461.767a dynamic VxLAN: 172.20.0.2->172.20.0.0
|
|
40 9c69.b461.767b dynamic eth-0-4
|
|
```
|
|
Turning my attention to the VLAN database, I can now see the power of this become obvious. This
|
|
switch has any number of local interfaces either tagged or untagged (in the case of VLAN 10 we can
|
|
see `eth-0-1(u)` which means that interface is participating in the VLAN untagged), but we can also
|
|
see that this VLAN 10 has a member port called `VxLAN: 172.20.0.2->172.20.0.0`. This port is just
|
|
like any other, in that it'll participate in unknown unicast, broadcast and multicast, and "learn"
|
|
MAC addresses behind these virtual overlay ports. In VLAN 10 (and VLAN 40), I can see in the L2 FIB
|
|
(`show mac address-table`), that there's a local MAC address learned (from the T-Rex loadtester)
|
|
behind `eth-0-1`, but there's also a remote MAC address learned behind the VxLAN port. I'm
|
|
impressed.
|
|
|
|
I can add any number of VLANs (and dot1q-tunnels) into a VTEP endpoint, after assigning each of them
|
|
a unique `VNI` (virtual network identifier). If you're curious about these, take a look at the
|
|
[[VxLAN](https://datatracker.ietf.org/doc/html/rfc7348)], [[GENEVE](https://datatracker.ietf.org/doc/html/rfc8926)]
|
|
and [[NvGRE](https://www.rfc-editor.org/rfc/rfc7637.html)] specifications. Basically, the encapsulation is just putting the
|
|
ethernet frame as a payload of an UDP packet, so let's take a look at those.
|
|
|
|
#### Inspecting overlay
|
|
|
|
As you'll recall, the VLAN 10,20,30,40 trafffic is now traveling over an IP network, notably
|
|
encapsulated by the source switch `msw-top` and delivered to `msw-bottom` via IGP (in my case,
|
|
OSPF), while it transits through `msw-core`. I decide to take a look at this, by configuring a
|
|
monitor port on `msw-core`:
|
|
|
|
```
|
|
msw-core# show run | inc moni
|
|
monitor session 1 source interface eth-0-49 both
|
|
monitor session 1 destination interface eth-0-1
|
|
```
|
|
|
|
This will copy all in- and egress traffic from interface `eth-0-49` (connected to `msw-top`) through
|
|
to local interface `eth-0-1`, which is connected to the loadtester. I can simply tcpdump this stuff:
|
|
|
|
```
|
|
pim@trex01:~$ sudo tcpdump -ni eno2 '(proto gre) or (udp and port 4789) or (udp and port 6081)'
|
|
01:26:24.685666 00:1e:08:26:ec:f3 > 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 174:
|
|
(tos 0x0, ttl 127, id 7496, offset 0, flags [DF], proto UDP (17), length 160)
|
|
172.20.0.0.49208 > 172.20.0.2.4789: VXLAN, flags [I] (0x08), vni 829810
|
|
68:05:ca:32:45:95 > 68:05:ca:32:45:94, ethertype IPv4 (0x0800), length 124:
|
|
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110)
|
|
48.0.1.47.1025 > 16.0.1.47.12: UDP, length 82
|
|
01:26:24.688305 00:1e:08:0d:6e:88 > 00:1e:08:26:ec:f3, ethertype IPv4 (0x0800), length 166:
|
|
(tos 0x0, ttl 128, id 44814, offset 0, flags [DF], proto GRE (47), length 152)
|
|
172.20.1.2 > 172.20.1.0: GREv0, Flags [key present], key=0xca97c38, proto TEB (0x6558), length 132
|
|
68:05:ca:32:45:97 > 68:05:ca:32:45:96, ethertype IPv4 (0x0800), length 124:
|
|
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110)
|
|
48.0.2.73.1025 > 16.0.2.73.12: UDP, length 82
|
|
01:26:24.689100 00:1e:08:26:ec:f3 > 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 178:
|
|
(tos 0x0, ttl 127, id 7502, offset 0, flags [DF], proto UDP (17), length 164)
|
|
172.20.2.0.49208 > 172.20.2.2.6081: GENEVE, Flags [none], vni 0xca986, proto TEB (0x6558)
|
|
9c:69:b4:61:76:79 > 9c:69:b4:61:76:78, ethertype 802.1Q (0x8100), length 128: vlan 30, p 0, ethertype IPv4 (0x0800),
|
|
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110)
|
|
48.0.3.109.1025 > 16.0.3.109.12: UDP, length 82
|
|
01:26:24.701666 00:1e:08:0d:6e:89 > 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 174:
|
|
(tos 0x0, ttl 127, id 7496, offset 0, flags [DF], proto UDP (17), length 160)
|
|
172.20.0.0.49208 > 172.20.0.2.4789: VXLAN, flags [I] (0x08), vni 829840
|
|
68:05:ca:32:45:95 > 68:05:ca:32:45:94, ethertype IPv4 (0x0800), length 124:
|
|
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110)
|
|
48.0.4.47.1025 > 16.0.4.47.12: UDP, length 82
|
|
```
|
|
|
|
We can see packets for all four tunnels in this dump. The first one is a UDP packet to port 4789,
|
|
which is the standard port for VxLAN, and it has VNI 829810. The second packet is proto GRE with
|
|
flag `TEB` which stands for _transparent ethernet bridge_ in other words an L2 variant of GRE that
|
|
carries ethernet frames. The third one shows that feature I configured above (in case you forgot it,
|
|
it's the `keep-vlan-tag` option when creating the VTEP), and because of that flag we can see that
|
|
the inner payload carries the `vlan 30` tag, neat! The `VNI` there is `0xca986` which is hex for
|
|
`829830`. Finally, the fourth one shows VLAN40 traffic that is sent to the same VTEP endpoint as
|
|
VLAN10 traffic (showing that multiple VLANs can be transported across the same tunnel, distinguished
|
|
by VNI).
|
|
|
|
{{< image width="90px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
|
|
|
At this point I make an important observation. VxLAN and GENEVE both have this really cool feature
|
|
that they can hash their _inner_ payload (ie. the IPv4/IPv6 address and ports if available) and use
|
|
that to randomize the _source port_, which makes them preferable to GRE. The reason why this is
|
|
preferable is hashing makes these inner flows become unique outer flows, which in turn allows them
|
|
to be loadbalanced in intermediate networks, but also in the receiver if it has multiple receive
|
|
queues. However, and this is important!, the switch does not hash, which means that all ethernet
|
|
traffic in the VxLAN, GENEVE and NvGRE tunnels always have the exact same outer header, so
|
|
loadbalancing and multiple receive queues are out of the question. I wonder if this is a limitation
|
|
of the Centec chip, or failure to program or configure it by the firmware.
|
|
|
|
With that gripe out of the way, let's take a look at 80Gbit of tunneled traffic, shall we?
|
|
|
|
{{< image src="/assets/oem-switch/overlay-trex.png" alt="Overlay T-Rex" >}}
|
|
|
|
Once again, all three switches are acing it. So at least 40Gbps of encap- and 40Gbps of decapsulation
|
|
per switch, and the transport over IPv4 through the `msw-core` switch to the other side, is all in
|
|
working order. On top of that, I've shown that multiple types of overlay can live alongside one
|
|
another, even betwween the same pair of switches, and that multiple VLANs can share the same underlay
|
|
transport. The only downside is the **single flow** nature of these UDP transports.
|
|
|
|
A final inspection of the switch throughput:
|
|
|
|
```
|
|
msw-top# show interface summary | exc DOWN
|
|
RXBS: rx rate (bits/sec) RXPS: rx rate (pkts/sec)
|
|
TXBS: tx rate (bits/sec) TXPS: tx rate (pkts/sec)
|
|
|
|
Interface Link RXBS RXPS TXBS TXPS
|
|
-----------------------------------------------------------------------------
|
|
eth-0-1 UP 10013004482 8456929 10013004548 8456929
|
|
eth-0-2 UP 10013030687 8456951 10013030801 8456951
|
|
eth-0-3 UP 10012625863 8456609 10012626030 8456609
|
|
eth-0-4 UP 10013032737 8456953 10013034423 8456954
|
|
eth-0-25 UP 505 0 513 0
|
|
eth-0-26 UP 51147539721 33827761 51147540111 33827762
|
|
|
|
```
|
|
|
|
Take a look at that `eth-0-26` interface: it's using significantly more bandwidth (51Gbps) than
|
|
the sum of the four transports (4x10Gbps). This is because each ethernet frame (of 128b) has to be
|
|
wrapped in an IPv4 UDP packet (or in the case of NvGRE an IPv4 packet with a GRE header), which
|
|
incurs quite some overhead, for small packets at least. But it definitely proves that the switches
|
|
here are happy to do this forwarding at line rate, and that's what counts!
|
|
|
|
### Conclusions
|
|
|
|
It's just super cool to see a switch like this work as expected. I did not manage to overload it at
|
|
all, neither with IPv4 loadtest at 67Mpps and 80Gbit of traffic, nor with L2 loadtest with four ports
|
|
transported with VxLAN, NvGRE and GENEVE, _at the same time_. Although the underlay can only use
|
|
IPv4 (no IPv6 is available in the switch chip), this is not a huge problem for me. At AS8298, I can
|
|
easily define some private VRF with IPv4 space from RFC1918 to do the transport of traffic over
|
|
VxLAN. And what's even better, this can perfectly inter-operate with my VPP routers which also do
|
|
VxLAN en/decapsulation.
|
|
|
|
Now there is one more thing for me to test (and, cliffhanger, I've tested it already but I'll have
|
|
to write up all of my data and results ...). I need to do what I said I would do in the beginning of
|
|
this article, and what I had hoped to achieve with the FS switches but failed to due to lack of
|
|
support: MPLS L2VPN transport (and, its more complex but cooler sibling VPLS).
|