Rewrite all images to Hugo format
This commit is contained in:
710
content/articles/2021-08-07-fs-switch.md
Normal file
710
content/articles/2021-08-07-fs-switch.md
Normal file
@ -0,0 +1,710 @@
|
||||
---
|
||||
date: "2021-08-07T06:17:54Z"
|
||||
title: 'Review: FS S5860-20SQ Switch'
|
||||
---
|
||||
|
||||
[FiberStore](https://fs.com/) is a staple provider of optics and network gear in Europe. Although
|
||||
I've been buying opfics like SFP+ and QSFP+ from them for years, I rarely looked at the switch
|
||||
hardware they have on sale, until my buddy Arend suggested one of their switches as a good
|
||||
alternative for an Internet Exchange Point, one with [Frysian roots](https://frys-ix.net/) no less!
|
||||
|
||||
# Executive Summary
|
||||
|
||||
{{< image width="400px" float="left" src="/assets/fs-switch/fs-switches.png" alt="Switches" >}}
|
||||
|
||||
The FS.com S5860 switch is pretty great: 20x 10G SFP+ ports, 4x 25G SPF28 ports
|
||||
and 2x 40G QSFP ports, which can also be reconfigued to be 4x10G each. The switch has a
|
||||
Cisco-like CLI, and great performance. I loadtested a pair of them in L2, QinQ, and in L3 mode,
|
||||
and they handled all the packets I sent to and through them, with all of 10G, 25G and 40G ports
|
||||
in use. Considering the redundant power supply with relatively low power usage, silicon based
|
||||
switching of L2 and L3, I definitely appreciate the price/performance. The switch would be a
|
||||
better match if it allowed for MPLS based L2VPN services, but it doesn't support that.
|
||||
|
||||
## Detailed findings
|
||||
|
||||
### Hardware
|
||||
|
||||
{{< image width="400px" float="right" src="/assets/fs-switch/fs-switch-inside.png" alt="Inside" >}}
|
||||
|
||||
The switch is based on Broadcom's [BCM56170](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56170), codename *Hurricane* with 28x10GbE + 4x25GbE ports internally, for a total switching bandwidth of 380Gbps. I
|
||||
noticed that the FS website shows 760Gbps of nonblocking capacity, which I can explain: Broadcom
|
||||
has taken the per port ingress capacity, while FS.com is taking the ingress/egress port
|
||||
capacity and summing them up. Further, the sales pitch claims 565Mpps which I found curious: if
|
||||
we divide the available bandwidth of 380Gbps (the number from the Broadcom dataspec) by smallest
|
||||
possible frame of 84 bytes (672 bits), we'll see 565Mpps. Why FS.com decided to seemingly
|
||||
arbitrarily double the switching capacity while reporting the nominal fowarding rate, is
|
||||
beyond me.
|
||||
|
||||
You can see more (hires) pictures in this [Photo Album](https://photos.app.goo.gl/6M69UgowGf6yWJmU7).
|
||||
|
||||
This Broadcom chip is an SOC (System-on-a-Chip) which comes with an Arm A9 and modest
|
||||
amount of TCAM on board and packs into a 31x31mm *ball grid array* formfactor. The switch
|
||||
chip is able to store 16k routes and ACLs - it did not become immediately obvious to me what
|
||||
the partitioning is (between IPv4 entries, IPv6 entries, L2/L3/L4 ACL entries). One can only
|
||||
assume that the total sum of TCAM based objects must not exceed 4K entries. This means that
|
||||
as a campus switch, the L3 functionalty will be great, including with routing protocols such
|
||||
as OSPF and ISIS. However, BGP with any amount of routing table activity will not be a good
|
||||
fit for this chip, so my dreams of porting DANOS to it are shot out of the box :-)
|
||||
|
||||
This Broadcom chip alone retails for €798,- apiece at [Digikey](https://www.digikey.ie/products/en?keywords=BCM56170B0IFSBG),
|
||||
with a manufacturer lead time of 50 weeks as of Aug'21, which may be related to the ongoing
|
||||
foundry and supply chain crisis, I don't know. But at that price point, the retail price of
|
||||
€1150,- per switch is really attractive.
|
||||
|
||||
{{< image width="300px" float="right" src="/assets/fs-switch/noise.png" alt="Noise" >}}
|
||||
|
||||
The switch comes with two modular and field-replaceble power supplies (rated at 150W each,
|
||||
delivering 12V at 12.5A, one fan installed), and with two modular and equally field replaceble
|
||||
fan trays installed with one fan each. Idle, without any optics installed and with all interfaces
|
||||
down, the switch draws about 18W of power, which is nice. The fans spin up only when needed,
|
||||
and by default the switch is quiet, but certainly not silent. I measured it after a tip from
|
||||
Michael, certainly nothing scientific, but in a silent room that measures a floor of ~30 dBA,
|
||||
the switch booted up and briefly burst the fans at 60dBA after which it **stabilized at 54dBA**
|
||||
or thereabouts. This is with both power supplies on, and with my cell phone microphone pointed
|
||||
directly towards the rear of the device, at 1 meter distance. Or something, IDK, I'm a network
|
||||
engineer, Jim, not an audio specialist!
|
||||
|
||||
Besides the 20x 1G/10G SFP+ ports, 4x 25G ports and 2x 40G ports (which, incidentally, can be
|
||||
broken out into 4x 10G as well, bringing the Tengig port count to the datasheet specified 28),
|
||||
the switch also comes with a USB port (which mounts a filesystem on a USB stick, handy to do
|
||||
firmware upgrades and to copy files such as SSH keys back and forth), an RJ45 1G management
|
||||
port, which does not participate in the switch at all, and an RJ45 serial port that uses a
|
||||
standard Cisco cable for access and presents itself as `9600,8n1` to a console server, although
|
||||
flow control must be disabled on the serial port.
|
||||
|
||||
#### Transceiver Compatibility
|
||||
|
||||
FS did not attempt any vendor locking or crippleware with the ports and optics, yaay for that.
|
||||
I successfully inserted Cisco optics, Arista optics, FS.com 'Generic' optics, and several DACs
|
||||
for 10G, 25G and 40G that I had lying around. The switch is happy to take all of them. The switch,
|
||||
as one would expect, supports diagnostics, which looks like this:
|
||||
|
||||
```
|
||||
fsw0#show interfaces TFGigabitEthernet0/24 transceiver
|
||||
Transceiver Type : 25GBASE-LR-SFP28
|
||||
Connector Type : LC
|
||||
Wavelength(nm) : 1310
|
||||
Transfer Distance :
|
||||
SMF fiber
|
||||
-- 10km
|
||||
Digital Diagnostic Monitoring : YES
|
||||
Vendor Serial Number : G2006362849
|
||||
|
||||
Current diagnostic parameters[AP:Average Power]:
|
||||
Temp(Celsius) Voltage(V) Bias(mA) RX power(dBm) TX power(dBm)
|
||||
33(OK) 3.29(OK) 38.31(OK) -0.10(OK)[AP] -0.07(OK)
|
||||
|
||||
Transceiver current alarm information:
|
||||
None
|
||||
```
|
||||
|
||||
.. with a helpful shorthand `show interfaces ... trans diag` that only shows the optical budget.
|
||||
|
||||
### Software
|
||||
|
||||
I bought a pair of switches, and they came delivered with a current firmware version. The devices
|
||||
idenfity themselves as `FS Campus Switch (S5860-20SQ) By FS.COM Inc` with a hardware version of `1.00`
|
||||
and a software version of `S5860_FSOS 12.4(1)B0101P1`. Firmware updates can be downloaded from the
|
||||
FS.com website directly. I'm not certain if there's a viable ONIE firmware for this chip, although the
|
||||
N8050 certainly can run ONIE, Cumulus and its own ICOS which is backed by Broadcom. Maybe
|
||||
in the future I could take a better look at the open networking firmware aspects of this type of
|
||||
hardware, but considering the CAM is tiny and the switch will do L2 in hardware, but L3 only up to
|
||||
a certain amount of routes (I think 4K or 16K in the FIB, and only 1GB of ram on the SOC), this is
|
||||
not the right platform to pour energy into trying to get DANOS to run on.
|
||||
|
||||
Taking a look at the CLI, it's very Cisco IOS-esque; there's a few small differences, but the look
|
||||
and feel is definitely familiar. Base configuration kind of looks like this:
|
||||
|
||||
```
|
||||
fsw0#show running-config
|
||||
hostname fsw0
|
||||
!
|
||||
sntp server oob 216.239.35.12
|
||||
sntp enable
|
||||
!
|
||||
username pim privilege 15 secret 5 $1$<redacted>
|
||||
!
|
||||
ip name-server oob 8.8.8.8
|
||||
!
|
||||
service password-encryption
|
||||
!
|
||||
enable service ssh-server
|
||||
no enable service telnet-server
|
||||
!
|
||||
interface Mgmt 0
|
||||
ip address 192.168.1.10 255.255.255.0
|
||||
gateway 192.168.1.1
|
||||
!
|
||||
snmp-server location Zurich, Switzerland
|
||||
snmp-server contact noc@ipng.ch
|
||||
snmp-server community 7 <redacted> ro
|
||||
!
|
||||
```
|
||||
|
||||
Configuration as well follows the familiar `conf t` (configure terminal) that many of us grew up
|
||||
with, and `show` command allow for `include` and `exclude` modifiers, of course with all the
|
||||
shortest-next abbriviations such as `sh int | i Forty` and the likes. VLANs are to be declared
|
||||
up front, with one notable cool feature of `supervlans`, which are the equivalent of aggregating
|
||||
VLANs together in the switch - a useful example might be an internet exchange platform which has
|
||||
trunk ports towards resellers, who might resell VLAN 101, 102, 103 each to an individual customer,
|
||||
but then all end up in the same peering lan VLAN 100.
|
||||
|
||||
A few of the services (SSH, SNMP, DNS, SNTP) can be bound to the management network, but for this
|
||||
to work, the `oob` keyword has to be used. This likely because the mgmt port is a network interface
|
||||
that is attached to the SOC, not to the switch fabric itself, and thus its route is not added to
|
||||
the routing table. I like this, because it avoids the mgmt network to be picked up in OSPF, and
|
||||
accidentally routed to/from. But it does show a bit more of an awkward config:
|
||||
|
||||
```
|
||||
fsw1#show running-config | inc oob
|
||||
sntp server oob 216.239.35.12
|
||||
ip name-server oob 8.8.8.8
|
||||
ip name-server oob 1.1.1.1
|
||||
ip name-server oob 9.9.9.9
|
||||
|
||||
fsw1#copy ?
|
||||
WORD Copy origin file from native
|
||||
flash: Copy origin file from flash: file system
|
||||
ftp: Copy origin file from ftp: file system
|
||||
http: Copy origin file from http: file system
|
||||
oob_ftp: Copy origin file from oob_ftp: file system
|
||||
oob_http: Copy origin file from oob_http: file system
|
||||
oob_tftp: Copy origin file from oob_tftp: file system
|
||||
running-config Copy origin file from running config
|
||||
startup-config Copy origin file from startup config
|
||||
tftp: Copy origin file from tftp: file system
|
||||
tmp: Copy origin file from tmp: file system
|
||||
usb0: Copy origin file from usb0: file system
|
||||
```
|
||||
|
||||
Note here the hack `oob_ftp:` and such; this would allow the switch to copy things from the
|
||||
OOB (management) network by overriding the scheme. But that's OK, I guess, not beautiful,
|
||||
but it gets the job done and these types of commands will rarely be used.
|
||||
|
||||
A few configuration examples, notably QinQ, in which I configure a port to take usual dot1q
|
||||
traffic, say from a customer, and add it into our local VLAN 200. Therefore, untagged traffic
|
||||
on that port will turn into our VLAN 200, and tagged traffic will turn into our dot1ad stack
|
||||
of outer VLAN 200 and inner VLAN whatever the customer provided -- in our case allowing only
|
||||
VLANs 1000-2000 and untagged traffic into VLAN 200:
|
||||
|
||||
```
|
||||
fsw0#confifgure
|
||||
fsw0(config)#vlan 200
|
||||
fsw0(config-vlan)#name v-qinq-outer
|
||||
fsw0(config-vlan)#exit
|
||||
fsw0(config)#interface TenGigabitEthernet 0/3
|
||||
fsw0(config-if-TenGigabitEthernet 0/3)#switchport mode dot1q-tunnel
|
||||
fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel native vlan 200
|
||||
fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel allowed vlan add untagged 200
|
||||
fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel allowed vlan add tagged 1000-2000
|
||||
```
|
||||
|
||||
The industry remains conflicted about the outer ethernet frame's type -- originally a tag
|
||||
protocol identifier (TPID) of 0x9100 was suggested, and that's what this switch uses. But
|
||||
the first specification of Q-in-Q called 802.1ad specified that the TPID should be 0x88a8
|
||||
instead of the VLAN tag that was 0x8100. This ugly reality can be reflected directly in the
|
||||
switchport configuration by adding a `frame-tag tpid 0xXXXX` value to let the switch know
|
||||
which TPID needs to be used for the outer tag.
|
||||
|
||||
If this type of historical thing interests you, I definitely recommend reading up on Wikipedia on
|
||||
[802.1q](https://en.wikipedia.org/wiki/IEEE_802.1Q) and [802.1ad](https://en.wikipedia.org/wiki/IEEE_802.1ad)
|
||||
as well.
|
||||
|
||||
## Loadtests
|
||||
|
||||
For my loadtests, I used Cisco's T-Rex ([ref](https://trex-tgn.cisco.com/)) in stateless mode,
|
||||
with a custom Python controller that ramps up and down traffic from the loadtester to the device
|
||||
under test (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be
|
||||
presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back
|
||||
in on `port0`). You can read a bit more about my setup in my [Loadtesting at Coloclue]({% post_url 2021-02-27-coloclue-loadtest %})
|
||||
post.
|
||||
|
||||
To stress test the switch, several pairs at 10G and 25G were used, and since the specs boast
|
||||
line rate forwarding, I immediately ran T-Rex at maximum load with small frames. I found out,
|
||||
once again, that Intel's X710 network cards aren't line rate, something I'll dive into in a bit
|
||||
more detail another day, for now, take a look at the [T-Rex docs](https://github.com/cisco-system-traffic-generator/trex-core/blob/master/doc/trex_stateless_bench.asciidoc).
|
||||
|
||||
### L2
|
||||
|
||||
First let's test a straight forward configuration. I connect a DAC between a 40G port on each
|
||||
switch, and connect a loadtester to port `TenGigabitEthernet 0/1` and `TenGigabitEthernet 0/2`
|
||||
on either switch, and leave everything simply in the default VLAN. This means packets from
|
||||
Te0/1 and Te0/2 go out on Fo0/26, then through the DAC into Fo0/26 on the second switch, and
|
||||
out on Te0/1 and Te0/2 there, to return to the loadtester. Configuration wise, rather boring:
|
||||
|
||||
```
|
||||
fsw0#configure
|
||||
fsw0(config)#vlan 1
|
||||
fsw0(config-vlan)#name v-default
|
||||
|
||||
fsw0#show run int te0/1
|
||||
interface TenGigabitEthernet 0/1
|
||||
|
||||
fsw0#show run int te0/2
|
||||
interface TenGigabitEthernet 0/2
|
||||
|
||||
fsw0#show run int fo0/26
|
||||
interface FortyGigabitEthernet 0/26
|
||||
switchport mode trunk
|
||||
switchport trunk allowed vlan only 1
|
||||
|
||||
fsw0#show vlan id 1
|
||||
VLAN Name Status Ports
|
||||
---------- -------------------------------- --------- -----------------------------------
|
||||
1 v-default STATIC Te0/1, Te0/2, Te0/3, Te0/4
|
||||
Te0/5, Te0/6, Te0/7, Te0/8
|
||||
Te0/9, Te0/10, Te0/11, Te0/12
|
||||
Te0/13, Te0/14, Te0/15, Te0/16
|
||||
Te0/17, Te0/18, Te0/19, Te0/20
|
||||
TF0/21, TF0/22, TF0/23, Fo0/25
|
||||
Fo0/26
|
||||
```
|
||||
|
||||
I set up T-Rex with unique MAC addresses for each of its ports, I find it useful
|
||||
to codify a few bits of information into the MAC, such as loadtester machine,
|
||||
PCI bus, port, so that when I try to find them on the switches in the forwarding
|
||||
table, and I have many loadtesters running at the same time, it's easier to
|
||||
find what I'm looking for. My trex configuration for this loadtest:
|
||||
```
|
||||
pim@hippo:~$ cat /etc/trex_cfg.yaml
|
||||
- version : 2
|
||||
interfaces : ["42:00.0","42:00.1", "42:00.2", "42:00.3"]
|
||||
port_limit : 4
|
||||
port_info :
|
||||
- dest_mac : [0x0,0x2,0x1,0x1,0x0,0x00] # port 0
|
||||
src_mac : [0x0,0x2,0x1,0x2,0x0,0x00]
|
||||
- dest_mac : [0x0,0x2,0x1,0x2,0x0,0x00] # port 1
|
||||
src_mac : [0x0,0x2,0x1,0x1,0x0,0x00]
|
||||
- dest_mac : [0x0,0x2,0x1,0x3,0x0,0x00] # port 2
|
||||
src_mac : [0x0,0x2,0x1,0x4,0x0,0x00]
|
||||
- dest_mac : [0x0,0x2,0x1,0x4,0x0,0x00] # port 3
|
||||
src_mac : [0x0,0x2,0x1,0x3,0x0,0x00]
|
||||
```
|
||||
|
||||
Here's where I notice something I've noticed before: the Intel X710 network cards cannot
|
||||
actually fill 4x10G at line rate. They're fine at larger frames, but they max out at about
|
||||
32Mpps throughput -- and we know that each 10G connection filled with small ethernet frames
|
||||
in one direction will consume 14.88Mpps. The same is true for the XXV710 cards, the chip
|
||||
used will really only source 30Mpps across all ports, which is sad but true.
|
||||
|
||||
So I have a choice to make: either I run small packets at a rate that's acceptable for the
|
||||
NIC (~7.5Mpps per port thus 30Mpps across the X710-DA4), or I run `imix` at line rate
|
||||
but with slightly less packets/sec. I chose the latter for these tests, and will be reporting
|
||||
the usage based on `imix` profile, which saturates 10G at 3.28Mpps in one direction, or
|
||||
13.12Mpps per network card.
|
||||
|
||||
Of course, I can run two of these at the same time, *pourquois pas*, which looks like this:
|
||||
|
||||
```
|
||||
fsw0#show mac
|
||||
Vlan MAC Address Type Interface Live Time
|
||||
---------- -------------------- -------- ------------------------------ -------------
|
||||
1 0001.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:16:11
|
||||
1 0001.0102.0000 DYNAMIC TenGigabitEthernet 0/1 0d 00:16:11
|
||||
1 0001.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:16:11
|
||||
1 0001.0104.0000 DYNAMIC TenGigabitEthernet 0/2 0d 00:16:10
|
||||
1 0002.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:15:51
|
||||
1 0002.0102.0000 DYNAMIC TenGigabitEthernet 0/3 0d 00:15:51
|
||||
1 0002.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:15:51
|
||||
1 0002.0104.0000 DYNAMIC TenGigabitEthernet 0/4 0d 00:15:50
|
||||
|
||||
fsw0#show int usage | exclude 0.00
|
||||
Interface Bandwidth Average Usage Output Usage Input Usage
|
||||
------------------------------------ ----------- ---------------- ---------------- ----------------
|
||||
TenGigabitEthernet 0/1 10000 Mbit 94.66% 94.66% 94.66%
|
||||
TenGigabitEthernet 0/2 10000 Mbit 94.66% 94.66% 94.66%
|
||||
TenGigabitEthernet 0/3 10000 Mbit 94.65% 94.66% 94.66%
|
||||
TenGigabitEthernet 0/4 10000 Mbit 94.66% 94.66% 94.66%
|
||||
FortyGigabitEthernet 0/26 40000 Mbit 94.66% 94.66% 94.66%
|
||||
|
||||
fsw0#show cpu core
|
||||
[Slot 0 : S5860-20SQ]
|
||||
Core 5Sec 1Min 5Min
|
||||
0 16.40% 12.00% 12.80%
|
||||
```
|
||||
|
||||
This is the first time that I noticed that the switch usage (94.66%) somewhat confusingly
|
||||
lines up with the observed T-Rex statistics: what the switch reports, T-Rex considers L2
|
||||
(ethernet) use, not L1 use. For an in-depth explanation of this, see below in the L3
|
||||
section. But for now, let's just say that when T-Rex says it's sending 37.9Gbps of ethernet
|
||||
traffic (which is 40.00Gbps of bits on the line), that corresponds to the 94.75% we see
|
||||
the switch reporting.
|
||||
|
||||
So suffice to say, at 80Gbit actual throughput (40G from Te0/1-3 ingress and 40G to
|
||||
Te0/1-3 egress), the switch performs at line rate, with no noticable lag or jitter.
|
||||
The CLI is responsive and the fans aren't spinning harder than at idle, even after 60min
|
||||
of packets. Good!
|
||||
|
||||
### QinQ
|
||||
|
||||
Then, I reconfigured the switch to let each pair of ports (Te0/1-2 and Te0/3-4) each drop
|
||||
into a Q-in-Q VLAN, with tag 20 and tag 21 respectively. The configuration:
|
||||
|
||||
|
||||
```
|
||||
interface TenGigabitEthernet 0/1
|
||||
switchport mode dot1q-tunnel
|
||||
switchport dot1q-tunnel allowed vlan add untagged 20
|
||||
switchport dot1q-tunnel native vlan 20
|
||||
!
|
||||
interface TenGigabitEthernet 0/3
|
||||
switchport mode dot1q-tunnel
|
||||
switchport dot1q-tunnel allowed vlan add untagged 21
|
||||
switchport dot1q-tunnel native vlan 21
|
||||
spanning-tree bpdufilter enable
|
||||
!
|
||||
interface FortyGigabitEthernet 0/26
|
||||
switchport mode trunk
|
||||
switchport trunk allowed vlan only 1,20-21
|
||||
|
||||
fsw0#show mac
|
||||
Vlan MAC Address Type Interface Live Time
|
||||
---------- -------------------- -------- ------------------------------ -------------
|
||||
20 0001.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 01:15:02
|
||||
20 0001.0102.0000 DYNAMIC TenGigabitEthernet 0/1 0d 01:15:01
|
||||
20 0001.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 01:15:02
|
||||
20 0001.0104.0000 DYNAMIC TenGigabitEthernet 0/2 0d 01:15:03
|
||||
21 0002.0101.0000 DYNAMIC TenGigabitEthernet 0/4 0d 00:01:50
|
||||
21 0002.0102.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:01:03
|
||||
21 0002.0103.0000 DYNAMIC TenGigabitEthernet 0/3 0d 00:01:59
|
||||
21 0002.0104.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:01:02
|
||||
```
|
||||
|
||||
Two things happen that require a bit of explanation. First of all, despite both loadtesters
|
||||
use the exact same configuration (in fact, I didn't even stop them from emitting packets
|
||||
while reconfiguring the switch), I now have packetloss, the throughput per 10G port has
|
||||
reduced from 94.67% to 93.63% and at the same time, I observe that the 40G ports raised
|
||||
their usage from 94.66% to 94.81%.
|
||||
|
||||
```
|
||||
fsw1#show int usage | e 0.00
|
||||
Interface Bandwidth Average Usage Output Usage Input Usage
|
||||
------------------------------------ ----------- ---------------- ---------------- ----------------
|
||||
TenGigabitEthernet 0/1 10000 Mbit 94.20% 93.63% 94.67%
|
||||
TenGigabitEthernet 0/2 10000 Mbit 94.21% 93.65% 94.67%
|
||||
TenGigabitEthernet 0/3 10000 Mbit 91.05% 94.66% 94.66%
|
||||
TenGigabitEthernet 0/4 10000 Mbit 90.80% 94.66% 94.66%
|
||||
FortyGigabitEthernet 0/26 40000 Mbit 94.81% 94.81% 94.81%
|
||||
```
|
||||
|
||||
The switches, however, are perfectly fine. The reason for this loss is that when I created
|
||||
the `dot1q-tunnel`, the switch sticks another VLAN tag (4 bytes, or 32 bits) on each packet
|
||||
before sending it out the 40G port between the switches, and at these packet rates, it adds
|
||||
up. Each 10G switchport is receiving 3.28Mpps (for a total of 13.12Mpps) which, when the
|
||||
switch needs to send it to its peer on the 40G trunk, adds 13.12Mpps * 32 bits = 419.8Mbps
|
||||
on top of the 40G line rate, implying we're going to be losing roughly 1.045% of our packets.
|
||||
And indeed, the difference between 94.67 (inbound) and 93.63 (outbound) is 1.04% which lines
|
||||
up.
|
||||
|
||||
|
||||
```
|
||||
Global Statistics
|
||||
|
||||
connection : localhost, Port 4501 total_tx_L2 : 37.92 Gbps
|
||||
version : STL @ v2.91 total_tx_L1 : 40.02 Gbps
|
||||
cpu_util. : 43.52% @ 8 cores (4 per dual port) total_rx : 37.92 Gbps
|
||||
rx_cpu_util. : 0.0% / 0 pps total_pps : 13.12 Mpps
|
||||
async_util. : 0% / 198.64 bps drop_rate : 0 bps
|
||||
total_cps. : 0 cps queue_full : 0 pkts
|
||||
|
||||
Port Statistics
|
||||
|
||||
port | 0 | 1 | 2 | 3
|
||||
-----------+-------------------+-------------------+-------------------+------------------
|
||||
owner | pim | pim | pim | pim
|
||||
link | UP | UP | UP | UP
|
||||
state | TRANSMITTING | TRANSMITTING | TRANSMITTING | TRANSMITTING
|
||||
speed | 10 Gb/s | 10 Gb/s | 10 Gb/s | 10 Gb/s
|
||||
CPU util. | 46.29% | 46.29% | 40.76% | 40.76%
|
||||
-- | | | |
|
||||
Tx bps L2 | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps
|
||||
Tx bps L1 | 10 Gbps | 10 Gbps | 10 Gbps | 10 Gbps
|
||||
Tx pps | 3.28 Mpps | 3.27 Mpps | 3.27 Mpps | 3.28 Mpps
|
||||
Line Util. | 100.04 % | 100.04 % | 100.04 % | 100.04 %
|
||||
--- | | | |
|
||||
Rx bps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps
|
||||
Rx pps | 3.24 Mpps | 3.24 Mpps | 3.23 Mpps | 3.24 Mpps
|
||||
---- | | | |
|
||||
opackets | 1891576526 | 1891577716 | 1891547042 | 1891548090
|
||||
ipackets | 1891576643 | 1891577837 | 1891547158 | 1891548214
|
||||
obytes | 684435443496 | 684435873418 | 684424773684 | 684425153614
|
||||
ibytes | 684435484082 | 684435916902 | 684424817178 | 684425197948
|
||||
tx-pkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts
|
||||
rx-pkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts
|
||||
tx-bytes | 684.44 GB | 684.44 GB | 684.42 GB | 684.43 GB
|
||||
rx-bytes | 684.44 GB | 684.44 GB | 684.42 GB | 684.43 GB
|
||||
----- | | | |
|
||||
oerrors | 0 | 0 | 0 | 0
|
||||
ierrors | 0 | 0 | 0 | 0
|
||||
```
|
||||
|
||||
### L3
|
||||
|
||||
For this test, I reconfigured the 25G ports to become routed rather than switched, and I put them
|
||||
under 80% load with T-Rex (where 80% here is of L1), thus the ports are emitting 20Gbps of traffic
|
||||
at a rate of 13.12Mpps. I left two of the 10G ports just continuing their ethernet loadtest at
|
||||
100%, which is also 20Gbps of traffic and 13.12Mpps. In total, I observed 79.95Gbps of traffic
|
||||
between the two switches: an entirely saturated 40G port in both directions.
|
||||
|
||||
I then created a simple topology with OSPF, both switches configured a `Loopback0` interface with
|
||||
a /32 IPv4 and /128 IPv6 address, and a transit network between them in a `VLAN100` interface.
|
||||
OSPF and OSPFv3 both distribute connected and static routes, to keep things simple.
|
||||
|
||||
Finally, I added an IP address on the `Tf0/24` interface, set a static IPv4 route for 16.0.0.0/8
|
||||
and 48.0.0.0/8 towards that interface on each switch, and added VLAN 100 to the `Fo0/26` trunk.
|
||||
It looks like this for switch `fsw0`:
|
||||
|
||||
```
|
||||
interface Loopback 0
|
||||
ip address 100.64.0.0 255.255.255.255
|
||||
ipv6 address 2001:DB8::/128
|
||||
ipv6 enable
|
||||
|
||||
interface VLAN 100
|
||||
ip address 100.65.2.1 255.255.255.252
|
||||
ipv6 enable
|
||||
ip ospf network point-to-point
|
||||
ipv6 ospf network point-to-point
|
||||
ipv6 ospf 1 area 0
|
||||
|
||||
interface TFGigabitEthernet 0/24
|
||||
no switchport
|
||||
ip address 100.65.1.1 255.255.255.0
|
||||
ipv6 address 2001:DB8:1:1::1/64
|
||||
|
||||
interface FortyGigabitEthernet 0/26
|
||||
switchport mode trunk
|
||||
switchport trunk allowed vlan only 1,20,21,100
|
||||
|
||||
router ospf 1
|
||||
graceful-restart
|
||||
redistribute connected subnets
|
||||
redistribute static subnets
|
||||
area 0
|
||||
network 100.65.2.0 0.0.0.3 area 0
|
||||
!
|
||||
|
||||
ipv6 router ospf 1
|
||||
graceful-restart
|
||||
redistribute connected
|
||||
redistribute static
|
||||
area 0
|
||||
!
|
||||
|
||||
ip route 16.0.0.0 255.0.0.0 100.65.1.2
|
||||
ipv6 route 2001:db8:100::/40 2001:db8:1:1::2
|
||||
```
|
||||
|
||||
With this topology, an L3 routing domain emerges between `Tf0/24` on switch `fsw0` and `Tf0/24`
|
||||
on switch `fsw1`, and we can inspect this, taking a look at `fsw1`, I can see that both IPv4 and
|
||||
IPv6 adjacencies have formed, and that the switches, néé routers, have learned
|
||||
routes from one another:
|
||||
|
||||
```
|
||||
fsw1#show ip ospf neighbor
|
||||
OSPF process 1, 1 Neighbors, 1 is Full:
|
||||
Neighbor ID Pri State BFD State Dead Time Address Interface
|
||||
100.65.2.1 1 Full/ - - 00:00:31 100.65.2.1 VLAN 100
|
||||
|
||||
fsw1#show ipv6 ospf neighbor
|
||||
OSPFv3 Process (1), 1 Neighbors, 1 is Full:
|
||||
Neighbor ID Pri State BFD State Dead Time Instance ID Interface
|
||||
100.65.2.1 1 Full/ - - 00:00:31 0 VLAN 100
|
||||
|
||||
fsw1#show ip route
|
||||
Codes: C - Connected, L - Local, S - Static
|
||||
R - RIP, O - OSPF, B - BGP, I - IS-IS, V - Overflow route
|
||||
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
|
||||
E1 - OSPF external type 1, E2 - OSPF external type 2
|
||||
SU - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
|
||||
IA - Inter area, EV - BGP EVPN, A - Arp to host
|
||||
* - candidate default
|
||||
|
||||
Gateway of last resort is no set
|
||||
O E2 16.0.0.0/8 [110/20] via 100.65.2.1, 12:42:13, VLAN 100
|
||||
S 48.0.0.0/8 [1/0] via 100.65.0.2
|
||||
O E2 100.64.0.0/32 [110/20] via 100.65.2.1, 00:05:23, VLAN 100
|
||||
C 100.64.0.1/32 is local host.
|
||||
C 100.65.0.0/24 is directly connected, TFGigabitEthernet 0/24
|
||||
C 100.65.0.1/32 is local host.
|
||||
O E2 100.65.1.0/24 [110/20] via 100.65.2.1, 12:44:57, VLAN 100
|
||||
C 100.65.2.0/30 is directly connected, VLAN 100
|
||||
C 100.65.2.2/32 is local host.
|
||||
|
||||
fsw1#show ipv6 route
|
||||
IPv6 routing table name - Default - 12 entries
|
||||
Codes: C - Connected, L - Local, S - Static
|
||||
R - RIP, O - OSPF, B - BGP, I - IS-IS, V - Overflow route
|
||||
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
|
||||
E1 - OSPF external type 1, E2 - OSPF external type 2
|
||||
SU - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
|
||||
IA - Inter area, EV - BGP EVPN, N - Nd to host
|
||||
|
||||
O E2 2001:DB8::/128 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100
|
||||
LC 2001:DB8::1/128 via Loopback 0, local host
|
||||
C 2001:DB8:1::/64 via TFGigabitEthernet 0/24, directly connected
|
||||
L 2001:DB8:1::1/128 via TFGigabitEthernet 0/24, local host
|
||||
O E2 2001:DB8:1:1::/64 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100
|
||||
O E2 2001:DB8:100::/40 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100
|
||||
C FE80::/10 via ::1, Null0
|
||||
C FE80::/64 via Loopback 0, directly connected
|
||||
L FE80::669D:99FF:FED0:A076/128 via Loopback 0, local host
|
||||
C FE80::/64 via TFGigabitEthernet 0/24, directly connected
|
||||
L FE80::669D:99FF:FED0:A076/128 via TFGigabitEthernet 0/24, local host
|
||||
C FE80::/64 via VLAN 100, directly connected
|
||||
L FE80::669D:99FF:FED0:A076/128 via VLAN 100, local host
|
||||
```
|
||||
|
||||
Great success! I can see from the `fsw1` output above, its OSPF process has learned
|
||||
routes for the IPv4 and IPv6 loopbacks (100.64.0.0/32 and 2001:DB8::1/128 respectively),
|
||||
the connected routes (100.65.1.0/24 and 2001:DB8:1:1::/64 respectively), and the
|
||||
static routes (16.0.0.0/8 and 2001:db8:100::/40).
|
||||
|
||||
So let's make use of this topology and change one of the two loadtesters to switch
|
||||
to L3 mode instead:
|
||||
|
||||
```
|
||||
pim@hippo:~$ cat /etc/trex_cfg.yaml
|
||||
- version : 2
|
||||
interfaces : ["0e:00.0", "0e:00.1" ]
|
||||
port_bandwidth_gb: 25
|
||||
port_limit : 2
|
||||
port_info :
|
||||
- ip : 100.65.0.2
|
||||
default_gw : 100.65.0.1
|
||||
- ip : 100.65.1.2
|
||||
default_gw : 100.65.1.1
|
||||
```
|
||||
|
||||
I left the loadtest running for 12hrs or so, and observed the results to be squeaky clean.
|
||||
The loadtester machine was generating ~96Gb/core at 20% utilization, so lazily generating
|
||||
40.00Gbit of traffic at 25.98Mpps (remember, this was setting the load to 80% on the 25G
|
||||
port, and 99% on the 10G ports). Looking at the switch and again being surprised about the
|
||||
discrepancy, I decided to fully explore the curiosity in this switch's utilization reporting.
|
||||
|
||||
```
|
||||
fsw1#show interfaces usage | exclude 0.00
|
||||
Interface Bandwidth Average Usage Output Usage Input Usage
|
||||
------------------------------------ ----------- -------------- -------------- -----------
|
||||
TenGigabitEthernet 0/1 10000 Mbit 93.80% 93.79% 93.81%
|
||||
TenGigabitEthernet 0/2 10000 Mbit 93.80% 93.79% 93.81%
|
||||
TFGigabitEthernet 0/24 25000 Mbit 75.80% 75.79% 75.81%
|
||||
FortyGigabitEthernet 0/26 40000 Mbit 94.79% 94.79% 94.79%
|
||||
|
||||
fsw1#show int te0/1 | inc packets/sec
|
||||
10 seconds input rate 9381044793 bits/sec, 3240802 packets/sec
|
||||
10 seconds output rate 9378930906 bits/sec, 3240123 packets/sec
|
||||
|
||||
fsw1#show int tf0/24 | inc packets/sec
|
||||
10 seconds input rate 18952369793 bits/sec, 6547299 packets/sec
|
||||
10 seconds output rate 18948317049 bits/sec, 6545915 packets/sec
|
||||
|
||||
fsw1#show int fo0/26 | inc packets/sec
|
||||
10 seconds input rate 37915517884 bits/sec, 13032078 packets/sec
|
||||
10 seconds output rate 37915335102 bits/sec, 13026051 packets/sec
|
||||
|
||||
```
|
||||
|
||||
Looking at that number, 75.80% was not the 80% that I had asked for, and actually the usage of
|
||||
the 10G ports (which I put at 99% load) and 40G port are also lower than I had anticipated.
|
||||
What's going on there? It's quite simple after doing some math: ***the switch is reporting
|
||||
L2 bits/sec, not L1 bits/sec!***
|
||||
|
||||
On the L3 loadtest and using the `imix` profile, T-Rex is sending 13.02Mpps of load, which,
|
||||
according to its own observation is 37.8Gbit of L2 and 40.00Gbps of L1 bandwidth. On the L2
|
||||
loadtest, again using `imix` profile, T-Rex is sending 4x 3.24Mpps as well, which it claims
|
||||
is 37.6Gbps of L2 and 39.66Gbps of L1 bandwidth (note: I put the loadtester here at 99% of
|
||||
line rate, this is to ensure I would not end up losing packets due to congestion on the 40G
|
||||
port).
|
||||
|
||||
So according to T-Rex, I am sending 75.4Gbps of traffic (37.8Gbps in the L2 test and 37.6Gbps
|
||||
in the simultenous L3 loadtest), yet I'm seeing 37.9Gbps on the switchport. Oh my!
|
||||
|
||||
Here's how all of these numbers relate to one another:
|
||||
|
||||
* First off, we are sending 99% linerate at 3.24Mpps into Te0/1 and Te0/2 on each switch.
|
||||
* Then, we are sending 80% linerate at 6.55Mpps into Tf0/24 on each switch.
|
||||
* The Te0/1 and Te0/2 are both in the default VLAN on either side.
|
||||
* But, the Tf0/24 is sending its IP traffic through VLAN 100 interconnect, which means all
|
||||
of that traffic gets a dot1q VLAN tag added. That's 4 bytes for each packet.
|
||||
* Sending 6.55Mpps * 32bits extra, equals 209600000 bits/sec (0.21Gbps)
|
||||
* Loadtester claims 37.70Gbps, but the switch sees 37.91Gbps which is exactly the difference
|
||||
we calculated above (0.21Gbps), and equals the overhead created by adding the VLAN tag on
|
||||
the 25G stream that is in VLAN tag 100.
|
||||
|
||||
Now we are ready to explain the difference between the switch reported port usage and the
|
||||
loadtester reported port usage:
|
||||
|
||||
* The loadtester is sending an `imix` traffic mix, which consts of a ratio of 28:16:4 of
|
||||
packets that are 64:590:1514 bytes.
|
||||
* We already know that to create a packet on the wire, we have to add a 7 byte preamble
|
||||
a 1 byte start frame delimiter, and end with a 12 byte interpacket gap, so each ethernet
|
||||
frame is 20 bytes longer, making 84 bytes the on-the-wire smallest possible frame.
|
||||
* We know we're sending 3.24Mpps on a 10G port at 99% T-Rex (L1) usage:
|
||||
* Each packet needs 20 bytes or 160 bits of overhead, which is 518400000 bits/sec
|
||||
* We are seeing 9381044793 bits/sec on a 10G port (**corresponding switch 93.80% usage**)
|
||||
* Adding these two numbers up gives us 9899444793 bits/sec (**corresponding T-Rex 98.99% usage**)
|
||||
* Conversely, the whole system is sending 37.9Gbps on the 40G port (**corresponding switch 37.9/40 == 94.79% usage**)
|
||||
* We know this is 2x 10G streams at 99% utilization and 1x25G stream at 80% utilization
|
||||
* This is 13.03Mpps, which generate 2084800000 bits/sec of overhead
|
||||
* Adding these two numbers up gives us 40.00 Gbps of usage (which is the expected L1 line rate)
|
||||
|
||||
I find it very fulfilling to see these numbers meaningfully add up! Oh, and by the way,
|
||||
the switches that are now switching and routing all of this with with 0.00% packet loss,
|
||||
and the chassis doesn't even get warm :-)
|
||||
|
||||
|
||||
```
|
||||
Global Statistics
|
||||
|
||||
connection : localhost, Port 4501 total_tx_L2 : 38.02 Gbps
|
||||
version : STL @ v2.91 total_tx_L1 : 40.02 Gbps
|
||||
cpu_util. : 21.88% @ 4 cores (4 per dual port) total_rx : 38.02 Gbps
|
||||
rx_cpu_util. : 0.0% / 0 pps total_pps : 13.13 Mpps
|
||||
async_util. : 0% / 39.16 bps drop_rate : 0 bps
|
||||
total_cps. : 0 cps queue_full : 0 pkts
|
||||
|
||||
Port Statistics
|
||||
|
||||
port | 0 | 1 | total
|
||||
-----------+-------------------+-------------------+------------------
|
||||
owner | pim | pim |
|
||||
link | UP | UP |
|
||||
state | TRANSMITTING | TRANSMITTING |
|
||||
speed | 25 Gb/s | 25 Gb/s |
|
||||
CPU util. | 21.88% | 21.88% |
|
||||
-- | | |
|
||||
Tx bps L2 | 19.01 Gbps | 19.01 Gbps | 38.02 Gbps
|
||||
Tx bps L1 | 20.06 Gbps | 20.06 Gbps | 40.12 Gbps
|
||||
Tx pps | 6.57 Mpps | 6.57 Mpps | 13.13 Mpps
|
||||
Line Util. | 80.23 % | 80.23 % |
|
||||
--- | | |
|
||||
Rx bps | 19 Gbps | 19 Gbps | 38.01 Gbps
|
||||
Rx pps | 6.56 Mpps | 6.56 Mpps | 13.13 Mpps
|
||||
---- | | |
|
||||
opackets | 292215661081 | 292215652102 | 584431313183
|
||||
ipackets | 292152912155 | 292153677482 | 584306589637
|
||||
obytes | 105733412810506 | 105733412001676 | 211466824812182
|
||||
ibytes | 105710857873526 | 105711223651650 | 211422081525176
|
||||
tx-pkts | 292.22 Gpkts | 292.22 Gpkts | 584.43 Gpkts
|
||||
rx-pkts | 292.15 Gpkts | 292.15 Gpkts | 584.31 Gpkts
|
||||
tx-bytes | 105.73 TB | 105.73 TB | 211.47 TB
|
||||
rx-bytes | 105.71 TB | 105.71 TB | 211.42 TB
|
||||
----- | | |
|
||||
oerrors | 0 | 0 | 0
|
||||
ierrors | 0 | 0 | 0
|
||||
```
|
||||
|
||||
### Conclusions
|
||||
|
||||
It's just super cool to see a switch like this work as expected. I did not manage to
|
||||
overload it at all, neither with IPv4 loadtest at 20Mpps and 50Gbit of traffic, nor with
|
||||
L2 loadtest at 26Mpps and 80Gbit of traffic, with QinQ demonstrably done in hardware as
|
||||
well as IPv4 route lookups. I will be putting these switches into production soon on the
|
||||
IPng Networks links between Glattbrugg and Rümlang in Zurich, thereby upgrading our
|
||||
backbone from 10G to 25G CWDM. It seems to me, that using these switches as L3 devices
|
||||
given a smaller OSPF routing domain (currently, we have ~300 prefixes in our OSPF at
|
||||
AS50869), would definitely work well, as would pushing and popping QinQ trunks for our
|
||||
customers (for example on Solnet or Init7 or Openfactory).
|
||||
|
||||
Approved. A+, will buy again.
|
Reference in New Issue
Block a user