711 lines
37 KiB
Markdown
711 lines
37 KiB
Markdown
---
|
|
date: "2021-08-07T06:17:54Z"
|
|
title: 'Review: FS S5860-20SQ Switch'
|
|
---
|
|
|
|
[FiberStore](https://fs.com/) is a staple provider of optics and network gear in Europe. Although
|
|
I've been buying opfics like SFP+ and QSFP+ from them for years, I rarely looked at the switch
|
|
hardware they have on sale, until my buddy Arend suggested one of their switches as a good
|
|
alternative for an Internet Exchange Point, one with [Frysian roots](https://frys-ix.net/) no less!
|
|
|
|
# Executive Summary
|
|
|
|
{{< image width="400px" float="left" src="/assets/fs-switch/fs-switches.png" alt="Switches" >}}
|
|
|
|
The FS.com S5860 switch is pretty great: 20x 10G SFP+ ports, 4x 25G SPF28 ports
|
|
and 2x 40G QSFP ports, which can also be reconfigued to be 4x10G each. The switch has a
|
|
Cisco-like CLI, and great performance. I loadtested a pair of them in L2, QinQ, and in L3 mode,
|
|
and they handled all the packets I sent to and through them, with all of 10G, 25G and 40G ports
|
|
in use. Considering the redundant power supply with relatively low power usage, silicon based
|
|
switching of L2 and L3, I definitely appreciate the price/performance. The switch would be a
|
|
better match if it allowed for MPLS based L2VPN services, but it doesn't support that.
|
|
|
|
## Detailed findings
|
|
|
|
### Hardware
|
|
|
|
{{< image width="400px" float="right" src="/assets/fs-switch/fs-switch-inside.png" alt="Inside" >}}
|
|
|
|
The switch is based on Broadcom's [BCM56170](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56170), codename *Hurricane* with 28x10GbE + 4x25GbE ports internally, for a total switching bandwidth of 380Gbps. I
|
|
noticed that the FS website shows 760Gbps of nonblocking capacity, which I can explain: Broadcom
|
|
has taken the per port ingress capacity, while FS.com is taking the ingress/egress port
|
|
capacity and summing them up. Further, the sales pitch claims 565Mpps which I found curious: if
|
|
we divide the available bandwidth of 380Gbps (the number from the Broadcom dataspec) by smallest
|
|
possible frame of 84 bytes (672 bits), we'll see 565Mpps. Why FS.com decided to seemingly
|
|
arbitrarily double the switching capacity while reporting the nominal fowarding rate, is
|
|
beyond me.
|
|
|
|
You can see more (hires) pictures in this [Photo Album](https://photos.app.goo.gl/6M69UgowGf6yWJmU7).
|
|
|
|
This Broadcom chip is an SOC (System-on-a-Chip) which comes with an Arm A9 and modest
|
|
amount of TCAM on board and packs into a 31x31mm *ball grid array* formfactor. The switch
|
|
chip is able to store 16k routes and ACLs - it did not become immediately obvious to me what
|
|
the partitioning is (between IPv4 entries, IPv6 entries, L2/L3/L4 ACL entries). One can only
|
|
assume that the total sum of TCAM based objects must not exceed 4K entries. This means that
|
|
as a campus switch, the L3 functionalty will be great, including with routing protocols such
|
|
as OSPF and ISIS. However, BGP with any amount of routing table activity will not be a good
|
|
fit for this chip, so my dreams of porting DANOS to it are shot out of the box :-)
|
|
|
|
This Broadcom chip alone retails for €798,- apiece at [Digikey](https://www.digikey.ie/products/en?keywords=BCM56170B0IFSBG),
|
|
with a manufacturer lead time of 50 weeks as of Aug'21, which may be related to the ongoing
|
|
foundry and supply chain crisis, I don't know. But at that price point, the retail price of
|
|
€1150,- per switch is really attractive.
|
|
|
|
{{< image width="300px" float="right" src="/assets/fs-switch/noise.png" alt="Noise" >}}
|
|
|
|
The switch comes with two modular and field-replaceble power supplies (rated at 150W each,
|
|
delivering 12V at 12.5A, one fan installed), and with two modular and equally field replaceble
|
|
fan trays installed with one fan each. Idle, without any optics installed and with all interfaces
|
|
down, the switch draws about 18W of power, which is nice. The fans spin up only when needed,
|
|
and by default the switch is quiet, but certainly not silent. I measured it after a tip from
|
|
Michael, certainly nothing scientific, but in a silent room that measures a floor of ~30 dBA,
|
|
the switch booted up and briefly burst the fans at 60dBA after which it **stabilized at 54dBA**
|
|
or thereabouts. This is with both power supplies on, and with my cell phone microphone pointed
|
|
directly towards the rear of the device, at 1 meter distance. Or something, IDK, I'm a network
|
|
engineer, Jim, not an audio specialist!
|
|
|
|
Besides the 20x 1G/10G SFP+ ports, 4x 25G ports and 2x 40G ports (which, incidentally, can be
|
|
broken out into 4x 10G as well, bringing the Tengig port count to the datasheet specified 28),
|
|
the switch also comes with a USB port (which mounts a filesystem on a USB stick, handy to do
|
|
firmware upgrades and to copy files such as SSH keys back and forth), an RJ45 1G management
|
|
port, which does not participate in the switch at all, and an RJ45 serial port that uses a
|
|
standard Cisco cable for access and presents itself as `9600,8n1` to a console server, although
|
|
flow control must be disabled on the serial port.
|
|
|
|
#### Transceiver Compatibility
|
|
|
|
FS did not attempt any vendor locking or crippleware with the ports and optics, yaay for that.
|
|
I successfully inserted Cisco optics, Arista optics, FS.com 'Generic' optics, and several DACs
|
|
for 10G, 25G and 40G that I had lying around. The switch is happy to take all of them. The switch,
|
|
as one would expect, supports diagnostics, which looks like this:
|
|
|
|
```
|
|
fsw0#show interfaces TFGigabitEthernet0/24 transceiver
|
|
Transceiver Type : 25GBASE-LR-SFP28
|
|
Connector Type : LC
|
|
Wavelength(nm) : 1310
|
|
Transfer Distance :
|
|
SMF fiber
|
|
-- 10km
|
|
Digital Diagnostic Monitoring : YES
|
|
Vendor Serial Number : G2006362849
|
|
|
|
Current diagnostic parameters[AP:Average Power]:
|
|
Temp(Celsius) Voltage(V) Bias(mA) RX power(dBm) TX power(dBm)
|
|
33(OK) 3.29(OK) 38.31(OK) -0.10(OK)[AP] -0.07(OK)
|
|
|
|
Transceiver current alarm information:
|
|
None
|
|
```
|
|
|
|
.. with a helpful shorthand `show interfaces ... trans diag` that only shows the optical budget.
|
|
|
|
### Software
|
|
|
|
I bought a pair of switches, and they came delivered with a current firmware version. The devices
|
|
idenfity themselves as `FS Campus Switch (S5860-20SQ) By FS.COM Inc` with a hardware version of `1.00`
|
|
and a software version of `S5860_FSOS 12.4(1)B0101P1`. Firmware updates can be downloaded from the
|
|
FS.com website directly. I'm not certain if there's a viable ONIE firmware for this chip, although the
|
|
N8050 certainly can run ONIE, Cumulus and its own ICOS which is backed by Broadcom. Maybe
|
|
in the future I could take a better look at the open networking firmware aspects of this type of
|
|
hardware, but considering the CAM is tiny and the switch will do L2 in hardware, but L3 only up to
|
|
a certain amount of routes (I think 4K or 16K in the FIB, and only 1GB of ram on the SOC), this is
|
|
not the right platform to pour energy into trying to get DANOS to run on.
|
|
|
|
Taking a look at the CLI, it's very Cisco IOS-esque; there's a few small differences, but the look
|
|
and feel is definitely familiar. Base configuration kind of looks like this:
|
|
|
|
```
|
|
fsw0#show running-config
|
|
hostname fsw0
|
|
!
|
|
sntp server oob 216.239.35.12
|
|
sntp enable
|
|
!
|
|
username pim privilege 15 secret 5 $1$<redacted>
|
|
!
|
|
ip name-server oob 8.8.8.8
|
|
!
|
|
service password-encryption
|
|
!
|
|
enable service ssh-server
|
|
no enable service telnet-server
|
|
!
|
|
interface Mgmt 0
|
|
ip address 192.168.1.10 255.255.255.0
|
|
gateway 192.168.1.1
|
|
!
|
|
snmp-server location Zurich, Switzerland
|
|
snmp-server contact noc@ipng.ch
|
|
snmp-server community 7 <redacted> ro
|
|
!
|
|
```
|
|
|
|
Configuration as well follows the familiar `conf t` (configure terminal) that many of us grew up
|
|
with, and `show` command allow for `include` and `exclude` modifiers, of course with all the
|
|
shortest-next abbriviations such as `sh int | i Forty` and the likes. VLANs are to be declared
|
|
up front, with one notable cool feature of `supervlans`, which are the equivalent of aggregating
|
|
VLANs together in the switch - a useful example might be an internet exchange platform which has
|
|
trunk ports towards resellers, who might resell VLAN 101, 102, 103 each to an individual customer,
|
|
but then all end up in the same peering lan VLAN 100.
|
|
|
|
A few of the services (SSH, SNMP, DNS, SNTP) can be bound to the management network, but for this
|
|
to work, the `oob` keyword has to be used. This likely because the mgmt port is a network interface
|
|
that is attached to the SOC, not to the switch fabric itself, and thus its route is not added to
|
|
the routing table. I like this, because it avoids the mgmt network to be picked up in OSPF, and
|
|
accidentally routed to/from. But it does show a bit more of an awkward config:
|
|
|
|
```
|
|
fsw1#show running-config | inc oob
|
|
sntp server oob 216.239.35.12
|
|
ip name-server oob 8.8.8.8
|
|
ip name-server oob 1.1.1.1
|
|
ip name-server oob 9.9.9.9
|
|
|
|
fsw1#copy ?
|
|
WORD Copy origin file from native
|
|
flash: Copy origin file from flash: file system
|
|
ftp: Copy origin file from ftp: file system
|
|
http: Copy origin file from http: file system
|
|
oob_ftp: Copy origin file from oob_ftp: file system
|
|
oob_http: Copy origin file from oob_http: file system
|
|
oob_tftp: Copy origin file from oob_tftp: file system
|
|
running-config Copy origin file from running config
|
|
startup-config Copy origin file from startup config
|
|
tftp: Copy origin file from tftp: file system
|
|
tmp: Copy origin file from tmp: file system
|
|
usb0: Copy origin file from usb0: file system
|
|
```
|
|
|
|
Note here the hack `oob_ftp:` and such; this would allow the switch to copy things from the
|
|
OOB (management) network by overriding the scheme. But that's OK, I guess, not beautiful,
|
|
but it gets the job done and these types of commands will rarely be used.
|
|
|
|
A few configuration examples, notably QinQ, in which I configure a port to take usual dot1q
|
|
traffic, say from a customer, and add it into our local VLAN 200. Therefore, untagged traffic
|
|
on that port will turn into our VLAN 200, and tagged traffic will turn into our dot1ad stack
|
|
of outer VLAN 200 and inner VLAN whatever the customer provided -- in our case allowing only
|
|
VLANs 1000-2000 and untagged traffic into VLAN 200:
|
|
|
|
```
|
|
fsw0#confifgure
|
|
fsw0(config)#vlan 200
|
|
fsw0(config-vlan)#name v-qinq-outer
|
|
fsw0(config-vlan)#exit
|
|
fsw0(config)#interface TenGigabitEthernet 0/3
|
|
fsw0(config-if-TenGigabitEthernet 0/3)#switchport mode dot1q-tunnel
|
|
fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel native vlan 200
|
|
fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel allowed vlan add untagged 200
|
|
fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel allowed vlan add tagged 1000-2000
|
|
```
|
|
|
|
The industry remains conflicted about the outer ethernet frame's type -- originally a tag
|
|
protocol identifier (TPID) of 0x9100 was suggested, and that's what this switch uses. But
|
|
the first specification of Q-in-Q called 802.1ad specified that the TPID should be 0x88a8
|
|
instead of the VLAN tag that was 0x8100. This ugly reality can be reflected directly in the
|
|
switchport configuration by adding a `frame-tag tpid 0xXXXX` value to let the switch know
|
|
which TPID needs to be used for the outer tag.
|
|
|
|
If this type of historical thing interests you, I definitely recommend reading up on Wikipedia on
|
|
[802.1q](https://en.wikipedia.org/wiki/IEEE_802.1Q) and [802.1ad](https://en.wikipedia.org/wiki/IEEE_802.1ad)
|
|
as well.
|
|
|
|
## Loadtests
|
|
|
|
For my loadtests, I used Cisco's T-Rex ([ref](https://trex-tgn.cisco.com/)) in stateless mode,
|
|
with a custom Python controller that ramps up and down traffic from the loadtester to the device
|
|
under test (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be
|
|
presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back
|
|
in on `port0`). You can read a bit more about my setup in my [Loadtesting at Coloclue]({% post_url 2021-02-27-coloclue-loadtest %})
|
|
post.
|
|
|
|
To stress test the switch, several pairs at 10G and 25G were used, and since the specs boast
|
|
line rate forwarding, I immediately ran T-Rex at maximum load with small frames. I found out,
|
|
once again, that Intel's X710 network cards aren't line rate, something I'll dive into in a bit
|
|
more detail another day, for now, take a look at the [T-Rex docs](https://github.com/cisco-system-traffic-generator/trex-core/blob/master/doc/trex_stateless_bench.asciidoc).
|
|
|
|
### L2
|
|
|
|
First let's test a straight forward configuration. I connect a DAC between a 40G port on each
|
|
switch, and connect a loadtester to port `TenGigabitEthernet 0/1` and `TenGigabitEthernet 0/2`
|
|
on either switch, and leave everything simply in the default VLAN. This means packets from
|
|
Te0/1 and Te0/2 go out on Fo0/26, then through the DAC into Fo0/26 on the second switch, and
|
|
out on Te0/1 and Te0/2 there, to return to the loadtester. Configuration wise, rather boring:
|
|
|
|
```
|
|
fsw0#configure
|
|
fsw0(config)#vlan 1
|
|
fsw0(config-vlan)#name v-default
|
|
|
|
fsw0#show run int te0/1
|
|
interface TenGigabitEthernet 0/1
|
|
|
|
fsw0#show run int te0/2
|
|
interface TenGigabitEthernet 0/2
|
|
|
|
fsw0#show run int fo0/26
|
|
interface FortyGigabitEthernet 0/26
|
|
switchport mode trunk
|
|
switchport trunk allowed vlan only 1
|
|
|
|
fsw0#show vlan id 1
|
|
VLAN Name Status Ports
|
|
---------- -------------------------------- --------- -----------------------------------
|
|
1 v-default STATIC Te0/1, Te0/2, Te0/3, Te0/4
|
|
Te0/5, Te0/6, Te0/7, Te0/8
|
|
Te0/9, Te0/10, Te0/11, Te0/12
|
|
Te0/13, Te0/14, Te0/15, Te0/16
|
|
Te0/17, Te0/18, Te0/19, Te0/20
|
|
TF0/21, TF0/22, TF0/23, Fo0/25
|
|
Fo0/26
|
|
```
|
|
|
|
I set up T-Rex with unique MAC addresses for each of its ports, I find it useful
|
|
to codify a few bits of information into the MAC, such as loadtester machine,
|
|
PCI bus, port, so that when I try to find them on the switches in the forwarding
|
|
table, and I have many loadtesters running at the same time, it's easier to
|
|
find what I'm looking for. My trex configuration for this loadtest:
|
|
```
|
|
pim@hippo:~$ cat /etc/trex_cfg.yaml
|
|
- version : 2
|
|
interfaces : ["42:00.0","42:00.1", "42:00.2", "42:00.3"]
|
|
port_limit : 4
|
|
port_info :
|
|
- dest_mac : [0x0,0x2,0x1,0x1,0x0,0x00] # port 0
|
|
src_mac : [0x0,0x2,0x1,0x2,0x0,0x00]
|
|
- dest_mac : [0x0,0x2,0x1,0x2,0x0,0x00] # port 1
|
|
src_mac : [0x0,0x2,0x1,0x1,0x0,0x00]
|
|
- dest_mac : [0x0,0x2,0x1,0x3,0x0,0x00] # port 2
|
|
src_mac : [0x0,0x2,0x1,0x4,0x0,0x00]
|
|
- dest_mac : [0x0,0x2,0x1,0x4,0x0,0x00] # port 3
|
|
src_mac : [0x0,0x2,0x1,0x3,0x0,0x00]
|
|
```
|
|
|
|
Here's where I notice something I've noticed before: the Intel X710 network cards cannot
|
|
actually fill 4x10G at line rate. They're fine at larger frames, but they max out at about
|
|
32Mpps throughput -- and we know that each 10G connection filled with small ethernet frames
|
|
in one direction will consume 14.88Mpps. The same is true for the XXV710 cards, the chip
|
|
used will really only source 30Mpps across all ports, which is sad but true.
|
|
|
|
So I have a choice to make: either I run small packets at a rate that's acceptable for the
|
|
NIC (~7.5Mpps per port thus 30Mpps across the X710-DA4), or I run `imix` at line rate
|
|
but with slightly less packets/sec. I chose the latter for these tests, and will be reporting
|
|
the usage based on `imix` profile, which saturates 10G at 3.28Mpps in one direction, or
|
|
13.12Mpps per network card.
|
|
|
|
Of course, I can run two of these at the same time, *pourquois pas*, which looks like this:
|
|
|
|
```
|
|
fsw0#show mac
|
|
Vlan MAC Address Type Interface Live Time
|
|
---------- -------------------- -------- ------------------------------ -------------
|
|
1 0001.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:16:11
|
|
1 0001.0102.0000 DYNAMIC TenGigabitEthernet 0/1 0d 00:16:11
|
|
1 0001.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:16:11
|
|
1 0001.0104.0000 DYNAMIC TenGigabitEthernet 0/2 0d 00:16:10
|
|
1 0002.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:15:51
|
|
1 0002.0102.0000 DYNAMIC TenGigabitEthernet 0/3 0d 00:15:51
|
|
1 0002.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:15:51
|
|
1 0002.0104.0000 DYNAMIC TenGigabitEthernet 0/4 0d 00:15:50
|
|
|
|
fsw0#show int usage | exclude 0.00
|
|
Interface Bandwidth Average Usage Output Usage Input Usage
|
|
------------------------------------ ----------- ---------------- ---------------- ----------------
|
|
TenGigabitEthernet 0/1 10000 Mbit 94.66% 94.66% 94.66%
|
|
TenGigabitEthernet 0/2 10000 Mbit 94.66% 94.66% 94.66%
|
|
TenGigabitEthernet 0/3 10000 Mbit 94.65% 94.66% 94.66%
|
|
TenGigabitEthernet 0/4 10000 Mbit 94.66% 94.66% 94.66%
|
|
FortyGigabitEthernet 0/26 40000 Mbit 94.66% 94.66% 94.66%
|
|
|
|
fsw0#show cpu core
|
|
[Slot 0 : S5860-20SQ]
|
|
Core 5Sec 1Min 5Min
|
|
0 16.40% 12.00% 12.80%
|
|
```
|
|
|
|
This is the first time that I noticed that the switch usage (94.66%) somewhat confusingly
|
|
lines up with the observed T-Rex statistics: what the switch reports, T-Rex considers L2
|
|
(ethernet) use, not L1 use. For an in-depth explanation of this, see below in the L3
|
|
section. But for now, let's just say that when T-Rex says it's sending 37.9Gbps of ethernet
|
|
traffic (which is 40.00Gbps of bits on the line), that corresponds to the 94.75% we see
|
|
the switch reporting.
|
|
|
|
So suffice to say, at 80Gbit actual throughput (40G from Te0/1-3 ingress and 40G to
|
|
Te0/1-3 egress), the switch performs at line rate, with no noticable lag or jitter.
|
|
The CLI is responsive and the fans aren't spinning harder than at idle, even after 60min
|
|
of packets. Good!
|
|
|
|
### QinQ
|
|
|
|
Then, I reconfigured the switch to let each pair of ports (Te0/1-2 and Te0/3-4) each drop
|
|
into a Q-in-Q VLAN, with tag 20 and tag 21 respectively. The configuration:
|
|
|
|
|
|
```
|
|
interface TenGigabitEthernet 0/1
|
|
switchport mode dot1q-tunnel
|
|
switchport dot1q-tunnel allowed vlan add untagged 20
|
|
switchport dot1q-tunnel native vlan 20
|
|
!
|
|
interface TenGigabitEthernet 0/3
|
|
switchport mode dot1q-tunnel
|
|
switchport dot1q-tunnel allowed vlan add untagged 21
|
|
switchport dot1q-tunnel native vlan 21
|
|
spanning-tree bpdufilter enable
|
|
!
|
|
interface FortyGigabitEthernet 0/26
|
|
switchport mode trunk
|
|
switchport trunk allowed vlan only 1,20-21
|
|
|
|
fsw0#show mac
|
|
Vlan MAC Address Type Interface Live Time
|
|
---------- -------------------- -------- ------------------------------ -------------
|
|
20 0001.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 01:15:02
|
|
20 0001.0102.0000 DYNAMIC TenGigabitEthernet 0/1 0d 01:15:01
|
|
20 0001.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 01:15:02
|
|
20 0001.0104.0000 DYNAMIC TenGigabitEthernet 0/2 0d 01:15:03
|
|
21 0002.0101.0000 DYNAMIC TenGigabitEthernet 0/4 0d 00:01:50
|
|
21 0002.0102.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:01:03
|
|
21 0002.0103.0000 DYNAMIC TenGigabitEthernet 0/3 0d 00:01:59
|
|
21 0002.0104.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:01:02
|
|
```
|
|
|
|
Two things happen that require a bit of explanation. First of all, despite both loadtesters
|
|
use the exact same configuration (in fact, I didn't even stop them from emitting packets
|
|
while reconfiguring the switch), I now have packetloss, the throughput per 10G port has
|
|
reduced from 94.67% to 93.63% and at the same time, I observe that the 40G ports raised
|
|
their usage from 94.66% to 94.81%.
|
|
|
|
```
|
|
fsw1#show int usage | e 0.00
|
|
Interface Bandwidth Average Usage Output Usage Input Usage
|
|
------------------------------------ ----------- ---------------- ---------------- ----------------
|
|
TenGigabitEthernet 0/1 10000 Mbit 94.20% 93.63% 94.67%
|
|
TenGigabitEthernet 0/2 10000 Mbit 94.21% 93.65% 94.67%
|
|
TenGigabitEthernet 0/3 10000 Mbit 91.05% 94.66% 94.66%
|
|
TenGigabitEthernet 0/4 10000 Mbit 90.80% 94.66% 94.66%
|
|
FortyGigabitEthernet 0/26 40000 Mbit 94.81% 94.81% 94.81%
|
|
```
|
|
|
|
The switches, however, are perfectly fine. The reason for this loss is that when I created
|
|
the `dot1q-tunnel`, the switch sticks another VLAN tag (4 bytes, or 32 bits) on each packet
|
|
before sending it out the 40G port between the switches, and at these packet rates, it adds
|
|
up. Each 10G switchport is receiving 3.28Mpps (for a total of 13.12Mpps) which, when the
|
|
switch needs to send it to its peer on the 40G trunk, adds 13.12Mpps * 32 bits = 419.8Mbps
|
|
on top of the 40G line rate, implying we're going to be losing roughly 1.045% of our packets.
|
|
And indeed, the difference between 94.67 (inbound) and 93.63 (outbound) is 1.04% which lines
|
|
up.
|
|
|
|
|
|
```
|
|
Global Statistics
|
|
|
|
connection : localhost, Port 4501 total_tx_L2 : 37.92 Gbps
|
|
version : STL @ v2.91 total_tx_L1 : 40.02 Gbps
|
|
cpu_util. : 43.52% @ 8 cores (4 per dual port) total_rx : 37.92 Gbps
|
|
rx_cpu_util. : 0.0% / 0 pps total_pps : 13.12 Mpps
|
|
async_util. : 0% / 198.64 bps drop_rate : 0 bps
|
|
total_cps. : 0 cps queue_full : 0 pkts
|
|
|
|
Port Statistics
|
|
|
|
port | 0 | 1 | 2 | 3
|
|
-----------+-------------------+-------------------+-------------------+------------------
|
|
owner | pim | pim | pim | pim
|
|
link | UP | UP | UP | UP
|
|
state | TRANSMITTING | TRANSMITTING | TRANSMITTING | TRANSMITTING
|
|
speed | 10 Gb/s | 10 Gb/s | 10 Gb/s | 10 Gb/s
|
|
CPU util. | 46.29% | 46.29% | 40.76% | 40.76%
|
|
-- | | | |
|
|
Tx bps L2 | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps
|
|
Tx bps L1 | 10 Gbps | 10 Gbps | 10 Gbps | 10 Gbps
|
|
Tx pps | 3.28 Mpps | 3.27 Mpps | 3.27 Mpps | 3.28 Mpps
|
|
Line Util. | 100.04 % | 100.04 % | 100.04 % | 100.04 %
|
|
--- | | | |
|
|
Rx bps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps
|
|
Rx pps | 3.24 Mpps | 3.24 Mpps | 3.23 Mpps | 3.24 Mpps
|
|
---- | | | |
|
|
opackets | 1891576526 | 1891577716 | 1891547042 | 1891548090
|
|
ipackets | 1891576643 | 1891577837 | 1891547158 | 1891548214
|
|
obytes | 684435443496 | 684435873418 | 684424773684 | 684425153614
|
|
ibytes | 684435484082 | 684435916902 | 684424817178 | 684425197948
|
|
tx-pkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts
|
|
rx-pkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts
|
|
tx-bytes | 684.44 GB | 684.44 GB | 684.42 GB | 684.43 GB
|
|
rx-bytes | 684.44 GB | 684.44 GB | 684.42 GB | 684.43 GB
|
|
----- | | | |
|
|
oerrors | 0 | 0 | 0 | 0
|
|
ierrors | 0 | 0 | 0 | 0
|
|
```
|
|
|
|
### L3
|
|
|
|
For this test, I reconfigured the 25G ports to become routed rather than switched, and I put them
|
|
under 80% load with T-Rex (where 80% here is of L1), thus the ports are emitting 20Gbps of traffic
|
|
at a rate of 13.12Mpps. I left two of the 10G ports just continuing their ethernet loadtest at
|
|
100%, which is also 20Gbps of traffic and 13.12Mpps. In total, I observed 79.95Gbps of traffic
|
|
between the two switches: an entirely saturated 40G port in both directions.
|
|
|
|
I then created a simple topology with OSPF, both switches configured a `Loopback0` interface with
|
|
a /32 IPv4 and /128 IPv6 address, and a transit network between them in a `VLAN100` interface.
|
|
OSPF and OSPFv3 both distribute connected and static routes, to keep things simple.
|
|
|
|
Finally, I added an IP address on the `Tf0/24` interface, set a static IPv4 route for 16.0.0.0/8
|
|
and 48.0.0.0/8 towards that interface on each switch, and added VLAN 100 to the `Fo0/26` trunk.
|
|
It looks like this for switch `fsw0`:
|
|
|
|
```
|
|
interface Loopback 0
|
|
ip address 100.64.0.0 255.255.255.255
|
|
ipv6 address 2001:DB8::/128
|
|
ipv6 enable
|
|
|
|
interface VLAN 100
|
|
ip address 100.65.2.1 255.255.255.252
|
|
ipv6 enable
|
|
ip ospf network point-to-point
|
|
ipv6 ospf network point-to-point
|
|
ipv6 ospf 1 area 0
|
|
|
|
interface TFGigabitEthernet 0/24
|
|
no switchport
|
|
ip address 100.65.1.1 255.255.255.0
|
|
ipv6 address 2001:DB8:1:1::1/64
|
|
|
|
interface FortyGigabitEthernet 0/26
|
|
switchport mode trunk
|
|
switchport trunk allowed vlan only 1,20,21,100
|
|
|
|
router ospf 1
|
|
graceful-restart
|
|
redistribute connected subnets
|
|
redistribute static subnets
|
|
area 0
|
|
network 100.65.2.0 0.0.0.3 area 0
|
|
!
|
|
|
|
ipv6 router ospf 1
|
|
graceful-restart
|
|
redistribute connected
|
|
redistribute static
|
|
area 0
|
|
!
|
|
|
|
ip route 16.0.0.0 255.0.0.0 100.65.1.2
|
|
ipv6 route 2001:db8:100::/40 2001:db8:1:1::2
|
|
```
|
|
|
|
With this topology, an L3 routing domain emerges between `Tf0/24` on switch `fsw0` and `Tf0/24`
|
|
on switch `fsw1`, and we can inspect this, taking a look at `fsw1`, I can see that both IPv4 and
|
|
IPv6 adjacencies have formed, and that the switches, néé routers, have learned
|
|
routes from one another:
|
|
|
|
```
|
|
fsw1#show ip ospf neighbor
|
|
OSPF process 1, 1 Neighbors, 1 is Full:
|
|
Neighbor ID Pri State BFD State Dead Time Address Interface
|
|
100.65.2.1 1 Full/ - - 00:00:31 100.65.2.1 VLAN 100
|
|
|
|
fsw1#show ipv6 ospf neighbor
|
|
OSPFv3 Process (1), 1 Neighbors, 1 is Full:
|
|
Neighbor ID Pri State BFD State Dead Time Instance ID Interface
|
|
100.65.2.1 1 Full/ - - 00:00:31 0 VLAN 100
|
|
|
|
fsw1#show ip route
|
|
Codes: C - Connected, L - Local, S - Static
|
|
R - RIP, O - OSPF, B - BGP, I - IS-IS, V - Overflow route
|
|
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
|
|
E1 - OSPF external type 1, E2 - OSPF external type 2
|
|
SU - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
|
|
IA - Inter area, EV - BGP EVPN, A - Arp to host
|
|
* - candidate default
|
|
|
|
Gateway of last resort is no set
|
|
O E2 16.0.0.0/8 [110/20] via 100.65.2.1, 12:42:13, VLAN 100
|
|
S 48.0.0.0/8 [1/0] via 100.65.0.2
|
|
O E2 100.64.0.0/32 [110/20] via 100.65.2.1, 00:05:23, VLAN 100
|
|
C 100.64.0.1/32 is local host.
|
|
C 100.65.0.0/24 is directly connected, TFGigabitEthernet 0/24
|
|
C 100.65.0.1/32 is local host.
|
|
O E2 100.65.1.0/24 [110/20] via 100.65.2.1, 12:44:57, VLAN 100
|
|
C 100.65.2.0/30 is directly connected, VLAN 100
|
|
C 100.65.2.2/32 is local host.
|
|
|
|
fsw1#show ipv6 route
|
|
IPv6 routing table name - Default - 12 entries
|
|
Codes: C - Connected, L - Local, S - Static
|
|
R - RIP, O - OSPF, B - BGP, I - IS-IS, V - Overflow route
|
|
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
|
|
E1 - OSPF external type 1, E2 - OSPF external type 2
|
|
SU - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
|
|
IA - Inter area, EV - BGP EVPN, N - Nd to host
|
|
|
|
O E2 2001:DB8::/128 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100
|
|
LC 2001:DB8::1/128 via Loopback 0, local host
|
|
C 2001:DB8:1::/64 via TFGigabitEthernet 0/24, directly connected
|
|
L 2001:DB8:1::1/128 via TFGigabitEthernet 0/24, local host
|
|
O E2 2001:DB8:1:1::/64 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100
|
|
O E2 2001:DB8:100::/40 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100
|
|
C FE80::/10 via ::1, Null0
|
|
C FE80::/64 via Loopback 0, directly connected
|
|
L FE80::669D:99FF:FED0:A076/128 via Loopback 0, local host
|
|
C FE80::/64 via TFGigabitEthernet 0/24, directly connected
|
|
L FE80::669D:99FF:FED0:A076/128 via TFGigabitEthernet 0/24, local host
|
|
C FE80::/64 via VLAN 100, directly connected
|
|
L FE80::669D:99FF:FED0:A076/128 via VLAN 100, local host
|
|
```
|
|
|
|
Great success! I can see from the `fsw1` output above, its OSPF process has learned
|
|
routes for the IPv4 and IPv6 loopbacks (100.64.0.0/32 and 2001:DB8::1/128 respectively),
|
|
the connected routes (100.65.1.0/24 and 2001:DB8:1:1::/64 respectively), and the
|
|
static routes (16.0.0.0/8 and 2001:db8:100::/40).
|
|
|
|
So let's make use of this topology and change one of the two loadtesters to switch
|
|
to L3 mode instead:
|
|
|
|
```
|
|
pim@hippo:~$ cat /etc/trex_cfg.yaml
|
|
- version : 2
|
|
interfaces : ["0e:00.0", "0e:00.1" ]
|
|
port_bandwidth_gb: 25
|
|
port_limit : 2
|
|
port_info :
|
|
- ip : 100.65.0.2
|
|
default_gw : 100.65.0.1
|
|
- ip : 100.65.1.2
|
|
default_gw : 100.65.1.1
|
|
```
|
|
|
|
I left the loadtest running for 12hrs or so, and observed the results to be squeaky clean.
|
|
The loadtester machine was generating ~96Gb/core at 20% utilization, so lazily generating
|
|
40.00Gbit of traffic at 25.98Mpps (remember, this was setting the load to 80% on the 25G
|
|
port, and 99% on the 10G ports). Looking at the switch and again being surprised about the
|
|
discrepancy, I decided to fully explore the curiosity in this switch's utilization reporting.
|
|
|
|
```
|
|
fsw1#show interfaces usage | exclude 0.00
|
|
Interface Bandwidth Average Usage Output Usage Input Usage
|
|
------------------------------------ ----------- -------------- -------------- -----------
|
|
TenGigabitEthernet 0/1 10000 Mbit 93.80% 93.79% 93.81%
|
|
TenGigabitEthernet 0/2 10000 Mbit 93.80% 93.79% 93.81%
|
|
TFGigabitEthernet 0/24 25000 Mbit 75.80% 75.79% 75.81%
|
|
FortyGigabitEthernet 0/26 40000 Mbit 94.79% 94.79% 94.79%
|
|
|
|
fsw1#show int te0/1 | inc packets/sec
|
|
10 seconds input rate 9381044793 bits/sec, 3240802 packets/sec
|
|
10 seconds output rate 9378930906 bits/sec, 3240123 packets/sec
|
|
|
|
fsw1#show int tf0/24 | inc packets/sec
|
|
10 seconds input rate 18952369793 bits/sec, 6547299 packets/sec
|
|
10 seconds output rate 18948317049 bits/sec, 6545915 packets/sec
|
|
|
|
fsw1#show int fo0/26 | inc packets/sec
|
|
10 seconds input rate 37915517884 bits/sec, 13032078 packets/sec
|
|
10 seconds output rate 37915335102 bits/sec, 13026051 packets/sec
|
|
|
|
```
|
|
|
|
Looking at that number, 75.80% was not the 80% that I had asked for, and actually the usage of
|
|
the 10G ports (which I put at 99% load) and 40G port are also lower than I had anticipated.
|
|
What's going on there? It's quite simple after doing some math: ***the switch is reporting
|
|
L2 bits/sec, not L1 bits/sec!***
|
|
|
|
On the L3 loadtest and using the `imix` profile, T-Rex is sending 13.02Mpps of load, which,
|
|
according to its own observation is 37.8Gbit of L2 and 40.00Gbps of L1 bandwidth. On the L2
|
|
loadtest, again using `imix` profile, T-Rex is sending 4x 3.24Mpps as well, which it claims
|
|
is 37.6Gbps of L2 and 39.66Gbps of L1 bandwidth (note: I put the loadtester here at 99% of
|
|
line rate, this is to ensure I would not end up losing packets due to congestion on the 40G
|
|
port).
|
|
|
|
So according to T-Rex, I am sending 75.4Gbps of traffic (37.8Gbps in the L2 test and 37.6Gbps
|
|
in the simultenous L3 loadtest), yet I'm seeing 37.9Gbps on the switchport. Oh my!
|
|
|
|
Here's how all of these numbers relate to one another:
|
|
|
|
* First off, we are sending 99% linerate at 3.24Mpps into Te0/1 and Te0/2 on each switch.
|
|
* Then, we are sending 80% linerate at 6.55Mpps into Tf0/24 on each switch.
|
|
* The Te0/1 and Te0/2 are both in the default VLAN on either side.
|
|
* But, the Tf0/24 is sending its IP traffic through VLAN 100 interconnect, which means all
|
|
of that traffic gets a dot1q VLAN tag added. That's 4 bytes for each packet.
|
|
* Sending 6.55Mpps * 32bits extra, equals 209600000 bits/sec (0.21Gbps)
|
|
* Loadtester claims 37.70Gbps, but the switch sees 37.91Gbps which is exactly the difference
|
|
we calculated above (0.21Gbps), and equals the overhead created by adding the VLAN tag on
|
|
the 25G stream that is in VLAN tag 100.
|
|
|
|
Now we are ready to explain the difference between the switch reported port usage and the
|
|
loadtester reported port usage:
|
|
|
|
* The loadtester is sending an `imix` traffic mix, which consts of a ratio of 28:16:4 of
|
|
packets that are 64:590:1514 bytes.
|
|
* We already know that to create a packet on the wire, we have to add a 7 byte preamble
|
|
a 1 byte start frame delimiter, and end with a 12 byte interpacket gap, so each ethernet
|
|
frame is 20 bytes longer, making 84 bytes the on-the-wire smallest possible frame.
|
|
* We know we're sending 3.24Mpps on a 10G port at 99% T-Rex (L1) usage:
|
|
* Each packet needs 20 bytes or 160 bits of overhead, which is 518400000 bits/sec
|
|
* We are seeing 9381044793 bits/sec on a 10G port (**corresponding switch 93.80% usage**)
|
|
* Adding these two numbers up gives us 9899444793 bits/sec (**corresponding T-Rex 98.99% usage**)
|
|
* Conversely, the whole system is sending 37.9Gbps on the 40G port (**corresponding switch 37.9/40 == 94.79% usage**)
|
|
* We know this is 2x 10G streams at 99% utilization and 1x25G stream at 80% utilization
|
|
* This is 13.03Mpps, which generate 2084800000 bits/sec of overhead
|
|
* Adding these two numbers up gives us 40.00 Gbps of usage (which is the expected L1 line rate)
|
|
|
|
I find it very fulfilling to see these numbers meaningfully add up! Oh, and by the way,
|
|
the switches that are now switching and routing all of this with with 0.00% packet loss,
|
|
and the chassis doesn't even get warm :-)
|
|
|
|
|
|
```
|
|
Global Statistics
|
|
|
|
connection : localhost, Port 4501 total_tx_L2 : 38.02 Gbps
|
|
version : STL @ v2.91 total_tx_L1 : 40.02 Gbps
|
|
cpu_util. : 21.88% @ 4 cores (4 per dual port) total_rx : 38.02 Gbps
|
|
rx_cpu_util. : 0.0% / 0 pps total_pps : 13.13 Mpps
|
|
async_util. : 0% / 39.16 bps drop_rate : 0 bps
|
|
total_cps. : 0 cps queue_full : 0 pkts
|
|
|
|
Port Statistics
|
|
|
|
port | 0 | 1 | total
|
|
-----------+-------------------+-------------------+------------------
|
|
owner | pim | pim |
|
|
link | UP | UP |
|
|
state | TRANSMITTING | TRANSMITTING |
|
|
speed | 25 Gb/s | 25 Gb/s |
|
|
CPU util. | 21.88% | 21.88% |
|
|
-- | | |
|
|
Tx bps L2 | 19.01 Gbps | 19.01 Gbps | 38.02 Gbps
|
|
Tx bps L1 | 20.06 Gbps | 20.06 Gbps | 40.12 Gbps
|
|
Tx pps | 6.57 Mpps | 6.57 Mpps | 13.13 Mpps
|
|
Line Util. | 80.23 % | 80.23 % |
|
|
--- | | |
|
|
Rx bps | 19 Gbps | 19 Gbps | 38.01 Gbps
|
|
Rx pps | 6.56 Mpps | 6.56 Mpps | 13.13 Mpps
|
|
---- | | |
|
|
opackets | 292215661081 | 292215652102 | 584431313183
|
|
ipackets | 292152912155 | 292153677482 | 584306589637
|
|
obytes | 105733412810506 | 105733412001676 | 211466824812182
|
|
ibytes | 105710857873526 | 105711223651650 | 211422081525176
|
|
tx-pkts | 292.22 Gpkts | 292.22 Gpkts | 584.43 Gpkts
|
|
rx-pkts | 292.15 Gpkts | 292.15 Gpkts | 584.31 Gpkts
|
|
tx-bytes | 105.73 TB | 105.73 TB | 211.47 TB
|
|
rx-bytes | 105.71 TB | 105.71 TB | 211.42 TB
|
|
----- | | |
|
|
oerrors | 0 | 0 | 0
|
|
ierrors | 0 | 0 | 0
|
|
```
|
|
|
|
### Conclusions
|
|
|
|
It's just super cool to see a switch like this work as expected. I did not manage to
|
|
overload it at all, neither with IPv4 loadtest at 20Mpps and 50Gbit of traffic, nor with
|
|
L2 loadtest at 26Mpps and 80Gbit of traffic, with QinQ demonstrably done in hardware as
|
|
well as IPv4 route lookups. I will be putting these switches into production soon on the
|
|
IPng Networks links between Glattbrugg and Rümlang in Zurich, thereby upgrading our
|
|
backbone from 10G to 25G CWDM. It seems to me, that using these switches as L3 devices
|
|
given a smaller OSPF routing domain (currently, we have ~300 prefixes in our OSPF at
|
|
AS50869), would definitely work well, as would pushing and popping QinQ trunks for our
|
|
customers (for example on Solnet or Init7 or Openfactory).
|
|
|
|
Approved. A+, will buy again.
|