Rewrite all images to Hugo format

2024-08-05 01:11:52 +02:00
parent 4230fd9acc
commit c1f1775c91
67 changed files with 29916 additions and 23 deletions
--- a/content/articles/2021-08-07-fs-switch.md
+++ b/content/articles/2021-08-07-fs-switch.md
@@ -0,0 +1,710 @@
+---
+date: "2021-08-07T06:17:54Z"
+title: 'Review: FS S5860-20SQ Switch'
+---
+
+[FiberStore](https://fs.com/) is a staple provider of optics and network gear in Europe. Although
+I've been buying opfics like SFP+ and QSFP+ from them for years, I rarely looked at the switch
+hardware they have on sale, until my buddy Arend suggested one of their switches as a good
+alternative for an Internet Exchange Point, one with [Frysian roots](https://frys-ix.net/) no less!
+
+# Executive Summary
+
+{{< image width="400px" float="left" src="/assets/fs-switch/fs-switches.png" alt="Switches" >}}
+
+The FS.com S5860 switch is pretty great: 20x 10G SFP+ ports, 4x 25G SPF28 ports
+and 2x 40G QSFP ports, which can also be reconfigued to be 4x10G each. The switch has a
+Cisco-like CLI, and great performance. I loadtested a pair of them in L2, QinQ, and in L3 mode,
+and they handled all the packets I sent to and through them, with all of 10G, 25G and 40G ports
+in use. Considering the redundant power supply with relatively low power usage, silicon based
+switching of L2 and L3, I definitely appreciate the price/performance. The switch would be a
+better match if it allowed for MPLS based L2VPN services, but it doesn't support that.
+
+## Detailed findings
+
+### Hardware
+
+{{< image width="400px" float="right" src="/assets/fs-switch/fs-switch-inside.png" alt="Inside" >}}
+
+The switch is based on Broadcom's [BCM56170](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56170), codename *Hurricane* with 28x10GbE + 4x25GbE ports internally, for a total switching bandwidth of 380Gbps. I
+noticed that the FS website shows 760Gbps of nonblocking capacity, which I can explain: Broadcom
+has taken the per port ingress capacity, while FS.com is taking the ingress/egress port
+capacity and summing them up. Further, the sales pitch claims 565Mpps which I found curious: if
+we divide the available bandwidth of 380Gbps (the number from the Broadcom dataspec) by smallest
+possible frame of 84 bytes (672 bits), we'll see 565Mpps. Why FS.com decided to seemingly
+arbitrarily double the switching capacity while reporting the nominal fowarding rate, is
+beyond me.
+
+You can see more (hires) pictures in this [Photo Album](https://photos.app.goo.gl/6M69UgowGf6yWJmU7).
+
+This Broadcom chip is an SOC (System-on-a-Chip) which comes with an Arm A9 and modest
+amount of TCAM on board and packs into a 31x31mm *ball grid array* formfactor. The switch
+chip is able to store 16k routes and ACLs - it did not become immediately obvious to me what
+the partitioning is (between IPv4 entries, IPv6 entries, L2/L3/L4 ACL entries). One can only
+assume that the total sum of TCAM based objects must not exceed 4K entries. This means that
+as a campus switch, the L3 functionalty will be great, including with routing protocols such
+as OSPF and ISIS. However, BGP with any amount of routing table activity will not be a good
+fit for this chip, so my dreams of porting DANOS to it are shot out of the box :-)
+
+This Broadcom chip alone retails for &euro;798,- apiece at [Digikey](https://www.digikey.ie/products/en?keywords=BCM56170B0IFSBG),
+with a manufacturer lead time of 50 weeks as of Aug'21, which may be related to the ongoing
+foundry and supply chain crisis, I don't know. But at that price point, the retail price of
+&euro;1150,- per switch is really attractive.
+
+{{< image width="300px" float="right" src="/assets/fs-switch/noise.png" alt="Noise" >}}
+
+The switch comes with two modular and field-replaceble power supplies (rated at 150W each,
+delivering 12V at 12.5A, one fan installed), and with two modular and equally field replaceble
+fan trays installed with one fan each. Idle, without any optics installed and with all interfaces
+down, the switch draws about 18W of power, which is nice. The fans spin up only when needed,
+and by default the switch is quiet, but certainly not silent. I measured it after a tip from
+Michael, certainly nothing scientific, but in a silent room that measures a floor of ~30 dBA,
+the switch booted up and briefly burst the fans at 60dBA after which it **stabilized at 54dBA**
+or thereabouts. This is with both power supplies on, and with my cell phone microphone pointed
+directly towards the rear of the device, at 1 meter distance. Or something, IDK, I'm a network
+engineer, Jim, not an audio specialist!
+
+Besides the 20x 1G/10G SFP+ ports, 4x 25G ports and 2x 40G ports (which, incidentally, can be
+broken out into 4x 10G as well, bringing the Tengig port count to the datasheet specified 28),
+the switch also comes with a USB port (which mounts a filesystem on a USB stick, handy to do
+firmware upgrades and to copy files such as SSH keys back and forth), an RJ45 1G management
+port, which does not participate in the switch at all, and an RJ45 serial port that uses a
+standard Cisco cable for access and presents itself as `9600,8n1` to a console server, although
+flow control must be disabled on the serial port.
+
+#### Transceiver Compatibility
+
+FS did not attempt any vendor locking or crippleware with the ports and optics, yaay for that.
+I successfully inserted Cisco optics, Arista optics, FS.com 'Generic' optics, and several DACs
+for 10G, 25G and 40G that I had lying around. The switch is happy to take all of them. The switch,
+as one would expect, supports diagnostics, which looks like this:
+
+```
+fsw0#show interfaces TFGigabitEthernet0/24 transceiver
+Transceiver Type    :  25GBASE-LR-SFP28
+Connector Type      :  LC
+Wavelength(nm)      :  1310
+Transfer Distance   :
+    SMF fiber
+        -- 10km
+Digital Diagnostic Monitoring  : YES
+Vendor Serial Number           : G2006362849
+
+Current diagnostic parameters[AP:Average Power]:
+Temp(Celsius)   Voltage(V)      Bias(mA)            RX power(dBm)       TX power(dBm)
+33(OK)          3.29(OK)        38.31(OK)           -0.10(OK)[AP]       -0.07(OK)
+
+Transceiver current alarm information:
+None
+```
+
+.. with a helpful shorthand `show interfaces ... trans diag` that only shows the optical budget.
+
+### Software
+
+I bought a pair of switches, and they came delivered with a current firmware version. The devices
+idenfity themselves as `FS Campus Switch (S5860-20SQ) By FS.COM Inc` with a hardware version of `1.00`
+and a software version of `S5860_FSOS 12.4(1)B0101P1`. Firmware updates can be downloaded from the
+FS.com website directly. I'm not certain if there's a viable ONIE firmware for this chip, although the
+N8050 certainly can run ONIE, Cumulus and its own ICOS which is backed by Broadcom. Maybe
+in the future I could take a better look at the open networking firmware aspects of this type of
+hardware, but considering the CAM is tiny and the switch will do L2 in hardware, but L3 only up to
+a certain amount of routes (I think 4K or 16K in the FIB, and only 1GB of ram on the SOC), this is
+not the right platform to pour energy into trying to get DANOS to run on.
+
+Taking a look at the CLI, it's very Cisco IOS-esque; there's a few small differences, but the look
+and feel is definitely familiar. Base configuration kind of looks like this:
+
+```
+fsw0#show running-config
+hostname fsw0
+!
+sntp server oob 216.239.35.12
+sntp enable
+!
+username pim privilege 15 secret 5 $1$<redacted>
+!
+ip name-server oob 8.8.8.8
+!
+service password-encryption
+!
+enable service ssh-server
+no enable service telnet-server
+!
+interface Mgmt 0
+ ip address 192.168.1.10 255.255.255.0
+ gateway 192.168.1.1
+!
+snmp-server location Zurich, Switzerland
+snmp-server contact noc@ipng.ch
+snmp-server community 7 <redacted> ro
+!
+```
+
+Configuration as well follows the familiar `conf t` (configure terminal) that many of us grew up
+with, and `show` command allow for `include` and `exclude` modifiers, of course with all the
+shortest-next abbriviations such as `sh int | i Forty` and the likes. VLANs are to be declared
+up front, with one notable cool feature of `supervlans`, which are the equivalent of aggregating
+VLANs together in the switch - a useful example might be an internet exchange platform which has
+trunk ports towards resellers, who might resell VLAN 101, 102, 103 each to an individual customer,
+but then all end up in the same peering lan VLAN 100.
+
+A few of the services (SSH, SNMP, DNS, SNTP) can be bound to the management network, but for this
+to work, the `oob` keyword has to be used. This likely because the mgmt port is a network interface
+that is attached to the SOC, not to the switch fabric itself, and thus its route is not added to
+the routing table. I like this, because it avoids the mgmt network to be picked up in OSPF, and
+accidentally routed to/from. But it does show a bit more of an awkward config:
+
+```
+fsw1#show running-config | inc oob
+sntp server oob 216.239.35.12
+ip name-server oob 8.8.8.8
+ip name-server oob 1.1.1.1
+ip name-server oob 9.9.9.9
+
+fsw1#copy ?
+  WORD            Copy origin file from native
+  flash:          Copy origin file from flash: file system
+  ftp:            Copy origin file from ftp: file system
+  http:           Copy origin file from http: file system
+  oob_ftp:        Copy origin file from oob_ftp: file system
+  oob_http:       Copy origin file from oob_http: file system
+  oob_tftp:       Copy origin file from oob_tftp: file system
+  running-config  Copy origin file from running config
+  startup-config  Copy origin file from startup config
+  tftp:           Copy origin file from tftp: file system
+  tmp:            Copy origin file from tmp: file system
+  usb0:           Copy origin file from usb0: file system
+```
+
+Note here the hack `oob_ftp:` and such; this would allow the switch to copy things from the
+OOB (management) network by overriding the scheme. But that's OK, I guess, not beautiful,
+but it gets the job done and these types of commands will rarely be used.
+
+A few configuration examples, notably QinQ, in which I configure a port to take usual dot1q
+traffic, say from a customer, and add it into our local VLAN 200. Therefore, untagged traffic
+on that port will turn into our VLAN 200, and tagged traffic will turn into our dot1ad stack
+of outer VLAN 200 and inner VLAN whatever the customer provided -- in our case allowing only
+VLANs 1000-2000 and untagged traffic into VLAN 200:
+
+```
+fsw0#confifgure
+fsw0(config)#vlan 200
+fsw0(config-vlan)#name v-qinq-outer
+fsw0(config-vlan)#exit
+fsw0(config)#interface TenGigabitEthernet 0/3
+fsw0(config-if-TenGigabitEthernet 0/3)#switchport mode dot1q-tunnel
+fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel native vlan 200
+fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel allowed vlan add untagged 200
+fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel allowed vlan add tagged 1000-2000
+```
+
+The industry remains conflicted about the outer ethernet frame's type -- originally a tag
+protocol identifier (TPID) of 0x9100 was suggested, and that's what this switch uses. But
+the first specification of Q-in-Q called 802.1ad specified that the TPID should be 0x88a8
+instead of the VLAN tag that was 0x8100. This ugly reality can be reflected directly in the
+switchport configuration by adding a `frame-tag tpid 0xXXXX` value to let the switch know
+which TPID needs to be used for the outer tag.
+
+If this type of historical thing interests you, I definitely recommend reading up on Wikipedia on
+[802.1q](https://en.wikipedia.org/wiki/IEEE_802.1Q) and [802.1ad](https://en.wikipedia.org/wiki/IEEE_802.1ad)
+as well.
+
+## Loadtests
+
+For my loadtests, I used Cisco's T-Rex ([ref](https://trex-tgn.cisco.com/)) in stateless mode,
+with a custom Python controller that ramps up and down traffic from the loadtester to the device
+under test (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be
+presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back
+in on `port0`). You can read a bit more about my setup in my [Loadtesting at Coloclue]({% post_url 2021-02-27-coloclue-loadtest %})
+post.
+
+To stress test the switch, several pairs at 10G and 25G were used, and since the specs boast
+line rate forwarding, I immediately ran T-Rex at maximum load with small frames. I found out,
+once again, that Intel's X710 network cards aren't line rate, something I'll dive into in a bit
+more detail another day, for now, take a look at the [T-Rex docs](https://github.com/cisco-system-traffic-generator/trex-core/blob/master/doc/trex_stateless_bench.asciidoc).
+
+### L2
+
+First let's test a straight forward configuration. I connect a DAC between a 40G port on each
+switch, and connect a loadtester to port `TenGigabitEthernet 0/1` and `TenGigabitEthernet 0/2`
+on either switch, and leave everything simply in the default VLAN. This means packets from
+Te0/1 and Te0/2 go out on Fo0/26, then through the DAC into Fo0/26 on the second switch, and
+out on Te0/1 and Te0/2 there, to return to the loadtester. Configuration wise, rather boring:
+
+```
+fsw0#configure
+fsw0(config)#vlan 1
+fsw0(config-vlan)#name v-default
+
+fsw0#show run int te0/1
+interface TenGigabitEthernet 0/1
+
+fsw0#show run int te0/2
+interface TenGigabitEthernet 0/2
+
+fsw0#show run int fo0/26
+interface FortyGigabitEthernet 0/26
+ switchport mode trunk
+ switchport trunk allowed vlan only 1
+
+fsw0#show vlan id 1
+VLAN       Name                             Status    Ports
+---------- -------------------------------- --------- -----------------------------------
+         1 v-default                        STATIC    Te0/1, Te0/2, Te0/3, Te0/4
+                                                      Te0/5, Te0/6, Te0/7, Te0/8
+                                                      Te0/9, Te0/10, Te0/11, Te0/12
+                                                      Te0/13, Te0/14, Te0/15, Te0/16
+                                                      Te0/17, Te0/18, Te0/19, Te0/20
+                                                      TF0/21, TF0/22, TF0/23, Fo0/25
+                                                      Fo0/26
+```
+
+I set up T-Rex with unique MAC addresses for each of its ports, I find it useful
+to codify a few bits of information into the MAC, such as loadtester machine,
+PCI bus, port, so that when I try to find them on the switches in the forwarding
+table, and I have many loadtesters running at the same time, it's easier to
+find what I'm looking for. My trex configuration for this loadtest:
+```
+pim@hippo:~$ cat /etc/trex_cfg.yaml
+- version       : 2
+  interfaces    : ["42:00.0","42:00.1", "42:00.2", "42:00.3"]
+  port_limit    : 4
+  port_info     :
+    - dest_mac  : [0x0,0x2,0x1,0x1,0x0,0x00]  # port 0
+      src_mac   : [0x0,0x2,0x1,0x2,0x0,0x00]
+    - dest_mac  : [0x0,0x2,0x1,0x2,0x0,0x00]  # port 1
+      src_mac   : [0x0,0x2,0x1,0x1,0x0,0x00]
+    - dest_mac  : [0x0,0x2,0x1,0x3,0x0,0x00]  # port 2
+      src_mac   : [0x0,0x2,0x1,0x4,0x0,0x00]
+    - dest_mac  : [0x0,0x2,0x1,0x4,0x0,0x00]  # port 3
+      src_mac   : [0x0,0x2,0x1,0x3,0x0,0x00]
+```
+
+Here's where I notice something I've noticed before: the Intel X710 network cards cannot
+actually fill 4x10G at line rate. They're fine at larger frames, but they max out at about
+32Mpps throughput -- and we know that each 10G connection filled with small ethernet frames
+in one direction will consume 14.88Mpps. The same is true for the XXV710 cards, the chip
+used will really only source 30Mpps across all ports, which is sad but true.
+
+So I have a choice to make: either I run small packets at a rate that's acceptable for the
+NIC (~7.5Mpps per port thus 30Mpps across the X710-DA4), or I run `imix` at line rate
+but with slightly less packets/sec. I chose the latter for these tests, and will be reporting
+the usage based on `imix` profile, which saturates 10G at 3.28Mpps in one direction, or
+13.12Mpps per network card.
+
+Of course, I can run two of these at the same time, *pourquois pas*, which looks like this:
+
+```
+fsw0#show mac
+Vlan        MAC Address          Type     Interface                      Live Time
+----------  -------------------- -------- ------------------------------ -------------
+   1        0001.0101.0000       DYNAMIC  FortyGigabitEthernet 0/26      0d 00:16:11
+   1        0001.0102.0000       DYNAMIC  TenGigabitEthernet 0/1         0d 00:16:11
+   1        0001.0103.0000       DYNAMIC  FortyGigabitEthernet 0/26      0d 00:16:11
+   1        0001.0104.0000       DYNAMIC  TenGigabitEthernet 0/2         0d 00:16:10
+   1        0002.0101.0000       DYNAMIC  FortyGigabitEthernet 0/26      0d 00:15:51
+   1        0002.0102.0000       DYNAMIC  TenGigabitEthernet 0/3         0d 00:15:51
+   1        0002.0103.0000       DYNAMIC  FortyGigabitEthernet 0/26      0d 00:15:51
+   1        0002.0104.0000       DYNAMIC  TenGigabitEthernet 0/4         0d 00:15:50
+
+fsw0#show int usage | exclude 0.00
+Interface                            Bandwidth   Average Usage    Output Usage     Input Usage
+------------------------------------ ----------- ---------------- ---------------- ----------------
+TenGigabitEthernet 0/1               10000  Mbit 94.66%           94.66%           94.66%
+TenGigabitEthernet 0/2               10000  Mbit 94.66%           94.66%           94.66%
+TenGigabitEthernet 0/3               10000  Mbit 94.65%           94.66%           94.66%
+TenGigabitEthernet 0/4               10000  Mbit 94.66%           94.66%           94.66%
+FortyGigabitEthernet 0/26            40000  Mbit 94.66%           94.66%           94.66%
+
+fsw0#show cpu core
+[Slot 0 : S5860-20SQ]
+Core    5Sec    1Min    5Min
+   0  16.40%  12.00%  12.80%
+```
+
+This is the first time that I noticed that the switch usage (94.66%) somewhat confusingly
+lines up with the observed T-Rex statistics: what the switch reports, T-Rex considers L2
+(ethernet) use, not L1 use. For an in-depth explanation of this, see below in the L3
+section. But for now, let's just say that when T-Rex says it's sending 37.9Gbps of ethernet
+traffic (which is 40.00Gbps of bits on the line), that corresponds to the 94.75% we see
+the switch reporting.
+
+So suffice to say, at 80Gbit actual throughput (40G from Te0/1-3 ingress and 40G to
+Te0/1-3 egress), the switch performs at line rate, with no noticable lag or jitter.
+The CLI is responsive and the fans aren't spinning harder than at idle, even after 60min
+of packets. Good!
+
+### QinQ
+
+Then, I reconfigured the switch to let each pair of ports (Te0/1-2 and Te0/3-4) each drop
+into a Q-in-Q VLAN, with tag 20 and tag 21 respectively. The configuration:
+
+
+```
+interface TenGigabitEthernet 0/1
+ switchport mode dot1q-tunnel
+ switchport dot1q-tunnel allowed vlan add untagged 20
+ switchport dot1q-tunnel native vlan 20
+!
+interface TenGigabitEthernet 0/3
+ switchport mode dot1q-tunnel
+ switchport dot1q-tunnel allowed vlan add untagged 21
+ switchport dot1q-tunnel native vlan 21
+ spanning-tree bpdufilter enable
+!
+interface FortyGigabitEthernet 0/26
+ switchport mode trunk
+ switchport trunk allowed vlan only 1,20-21
+
+fsw0#show mac
+Vlan        MAC Address          Type     Interface                      Live Time
+----------  -------------------- -------- ------------------------------ -------------
+  20        0001.0101.0000       DYNAMIC  FortyGigabitEthernet 0/26      0d 01:15:02
+  20        0001.0102.0000       DYNAMIC  TenGigabitEthernet 0/1         0d 01:15:01
+  20        0001.0103.0000       DYNAMIC  FortyGigabitEthernet 0/26      0d 01:15:02
+  20        0001.0104.0000       DYNAMIC  TenGigabitEthernet 0/2         0d 01:15:03
+  21        0002.0101.0000       DYNAMIC  TenGigabitEthernet 0/4         0d 00:01:50
+  21        0002.0102.0000       DYNAMIC  FortyGigabitEthernet 0/26      0d 00:01:03
+  21        0002.0103.0000       DYNAMIC  TenGigabitEthernet 0/3         0d 00:01:59
+  21        0002.0104.0000       DYNAMIC  FortyGigabitEthernet 0/26      0d 00:01:02
+```
+
+Two things happen that require a bit of explanation. First of all, despite both loadtesters
+use the exact same configuration (in fact, I didn't even stop them from emitting packets
+while reconfiguring the switch), I now have packetloss, the throughput per 10G port has
+reduced from 94.67% to 93.63% and at the same time, I observe that the 40G ports raised
+their usage from 94.66% to 94.81%.
+
+```
+fsw1#show int usage | e 0.00
+Interface                            Bandwidth   Average Usage    Output Usage     Input Usage
+------------------------------------ ----------- ---------------- ---------------- ----------------
+TenGigabitEthernet 0/1               10000  Mbit 94.20%           93.63%           94.67%
+TenGigabitEthernet 0/2               10000  Mbit 94.21%           93.65%           94.67%
+TenGigabitEthernet 0/3               10000  Mbit 91.05%           94.66%           94.66%
+TenGigabitEthernet 0/4               10000  Mbit 90.80%           94.66%           94.66%
+FortyGigabitEthernet 0/26            40000  Mbit 94.81%           94.81%           94.81%
+```
+
+The switches, however, are perfectly fine. The reason for this loss is that when I created
+the `dot1q-tunnel`, the switch sticks another VLAN tag (4 bytes, or 32 bits) on each packet
+before sending it out the 40G port between the switches, and at these packet rates, it adds
+up. Each 10G switchport is receiving 3.28Mpps (for a total of 13.12Mpps) which, when the
+switch needs to send it to its peer on the 40G trunk, adds 13.12Mpps * 32 bits = 419.8Mbps
+on top of the 40G line rate, implying we're going to be losing roughly 1.045% of our packets.
+And indeed, the difference between 94.67 (inbound) and 93.63 (outbound) is 1.04% which lines
+up.
+
+
+```
+Global Statistics
+
+connection   : localhost, Port 4501                       total_tx_L2  : 37.92 Gbps
+version      : STL @ v2.91                                total_tx_L1  : 40.02 Gbps
+cpu_util.    : 43.52% @ 8 cores (4 per dual port)         total_rx     : 37.92 Gbps
+rx_cpu_util. : 0.0% / 0 pps                               total_pps    : 13.12 Mpps
+async_util.  : 0% / 198.64 bps                            drop_rate    : 0 bps
+total_cps.   : 0 cps                                      queue_full   : 0 pkts
+
+Port Statistics
+
+   port    |         0         |         1         |         2         |         3
+-----------+-------------------+-------------------+-------------------+------------------
+owner      |               pim |               pim |               pim |               pim
+link       |                UP |                UP |                UP |                UP
+state      |      TRANSMITTING |      TRANSMITTING |      TRANSMITTING |      TRANSMITTING
+speed      |           10 Gb/s |           10 Gb/s |           10 Gb/s |           10 Gb/s
+CPU util.  |            46.29% |            46.29% |            40.76% |            40.76%
+--         |                   |                   |                   |
+Tx bps L2  |         9.48 Gbps |         9.48 Gbps |         9.48 Gbps |         9.48 Gbps
+Tx bps L1  |           10 Gbps |           10 Gbps |           10 Gbps |           10 Gbps
+Tx pps     |         3.28 Mpps |         3.27 Mpps |         3.27 Mpps |         3.28 Mpps
+Line Util. |          100.04 % |          100.04 % |          100.04 % |          100.04 %
+---        |                   |                   |                   |
+Rx bps     |         9.48 Gbps |         9.48 Gbps |         9.48 Gbps |         9.48 Gbps
+Rx pps     |         3.24 Mpps |         3.24 Mpps |         3.23 Mpps |         3.24 Mpps
+----       |                   |                   |                   |
+opackets   |        1891576526 |        1891577716 |        1891547042 |        1891548090
+ipackets   |        1891576643 |        1891577837 |        1891547158 |        1891548214
+obytes     |      684435443496 |      684435873418 |      684424773684 |      684425153614
+ibytes     |      684435484082 |      684435916902 |      684424817178 |      684425197948
+tx-pkts    |        1.89 Gpkts |        1.89 Gpkts |        1.89 Gpkts |        1.89 Gpkts
+rx-pkts    |        1.89 Gpkts |        1.89 Gpkts |        1.89 Gpkts |        1.89 Gpkts
+tx-bytes   |         684.44 GB |         684.44 GB |         684.42 GB |         684.43 GB
+rx-bytes   |         684.44 GB |         684.44 GB |         684.42 GB |         684.43 GB
+-----      |                   |                   |                   |
+oerrors    |                 0 |                 0 |                 0 |                 0
+ierrors    |                 0 |                 0 |                 0 |                 0
+```
+
+### L3
+
+For this test, I reconfigured the 25G ports to become routed rather than switched, and I put them
+under 80% load with T-Rex (where 80% here is of L1), thus the ports are emitting 20Gbps of traffic
+at a rate of 13.12Mpps. I left two of the 10G ports just continuing their ethernet loadtest at
+100%, which is also 20Gbps of traffic and 13.12Mpps. In total, I observed 79.95Gbps of traffic
+between the two switches: an entirely saturated 40G port in both directions.
+
+I then created a simple topology with OSPF, both switches configured a `Loopback0` interface with
+a /32 IPv4 and /128 IPv6 address, and a transit network between them in a `VLAN100` interface.
+OSPF and OSPFv3 both distribute connected and static routes, to keep things simple.
+
+Finally, I added an IP address on the `Tf0/24` interface, set a static IPv4 route for 16.0.0.0/8
+and 48.0.0.0/8 towards that interface on each switch, and added VLAN 100 to the `Fo0/26` trunk.
+It looks like this for switch `fsw0`:
+
+```
+interface Loopback 0
+ ip address 100.64.0.0 255.255.255.255
+ ipv6 address 2001:DB8::/128
+ ipv6 enable
+
+interface VLAN 100
+ ip address 100.65.2.1 255.255.255.252
+ ipv6 enable
+ ip ospf network point-to-point
+ ipv6 ospf network point-to-point
+ ipv6 ospf 1 area 0
+
+interface TFGigabitEthernet 0/24
+ no switchport
+ ip address 100.65.1.1 255.255.255.0
+ ipv6 address 2001:DB8:1:1::1/64
+
+interface FortyGigabitEthernet 0/26
+ switchport mode trunk
+ switchport trunk allowed vlan only 1,20,21,100
+
+router ospf 1
+ graceful-restart
+ redistribute connected subnets
+ redistribute static subnets
+ area 0
+ network 100.65.2.0 0.0.0.3 area 0
+!
+
+ipv6 router ospf 1
+ graceful-restart
+ redistribute connected
+ redistribute static
+ area 0
+!
+
+ip route 16.0.0.0 255.0.0.0 100.65.1.2
+ipv6 route 2001:db8:100::/40 2001:db8:1:1::2
+```
+
+With this topology, an L3 routing domain emerges between `Tf0/24` on switch `fsw0` and `Tf0/24`
+on switch `fsw1`, and we can inspect this, taking a look at `fsw1`, I can see that both IPv4 and
+IPv6 adjacencies have formed, and that the switches, n&eacute;&eacute; routers, have learned
+routes from one another:
+
+```
+fsw1#show ip ospf  neighbor
+OSPF process 1, 1 Neighbors, 1 is Full:
+Neighbor ID     Pri   State                BFD State  Dead Time   Address         Interface
+100.65.2.1        1   Full/ -              -          00:00:31    100.65.2.1      VLAN 100
+
+fsw1#show ipv6 ospf  neighbor
+OSPFv3 Process (1), 1 Neighbors, 1 is Full:
+Neighbor ID     Pri   State            BFD State  Dead Time    Instance ID   Interface
+100.65.2.1        1   Full/ -          -          00:00:31     0             VLAN 100
+
+fsw1#show ip route
+Codes:  C - Connected, L - Local, S - Static
+        R - RIP, O - OSPF, B - BGP, I - IS-IS, V - Overflow route
+        N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
+        E1 - OSPF external type 1, E2 - OSPF external type 2
+        SU - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
+        IA - Inter area, EV - BGP EVPN, A - Arp to host
+        * - candidate default
+
+Gateway of last resort is no set
+O E2  16.0.0.0/8 [110/20] via 100.65.2.1, 12:42:13, VLAN 100
+S     48.0.0.0/8 [1/0] via 100.65.0.2
+O E2  100.64.0.0/32 [110/20] via 100.65.2.1, 00:05:23, VLAN 100
+C     100.64.0.1/32 is local host.
+C     100.65.0.0/24 is directly connected, TFGigabitEthernet 0/24
+C     100.65.0.1/32 is local host.
+O E2  100.65.1.0/24 [110/20] via 100.65.2.1, 12:44:57, VLAN 100
+C     100.65.2.0/30 is directly connected, VLAN 100
+C     100.65.2.2/32 is local host.
+
+fsw1#show ipv6 route
+IPv6 routing table name - Default - 12 entries
+Codes:  C - Connected, L - Local, S - Static
+        R - RIP, O - OSPF, B - BGP, I - IS-IS, V - Overflow route
+        N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
+        E1 - OSPF external type 1, E2 - OSPF external type 2
+        SU - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
+        IA - Inter area, EV - BGP EVPN, N - Nd to host
+
+O  E2  2001:DB8::/128 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100
+LC     2001:DB8::1/128 via Loopback 0, local host
+C      2001:DB8:1::/64 via TFGigabitEthernet 0/24, directly connected
+L      2001:DB8:1::1/128 via TFGigabitEthernet 0/24, local host
+O  E2  2001:DB8:1:1::/64 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100
+O  E2  2001:DB8:100::/40 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100
+C      FE80::/10 via ::1, Null0
+C      FE80::/64 via Loopback 0, directly connected
+L      FE80::669D:99FF:FED0:A076/128 via Loopback 0, local host
+C      FE80::/64 via TFGigabitEthernet 0/24, directly connected
+L      FE80::669D:99FF:FED0:A076/128 via TFGigabitEthernet 0/24, local host
+C      FE80::/64 via VLAN 100, directly connected
+L      FE80::669D:99FF:FED0:A076/128 via VLAN 100, local host
+```
+
+Great success! I can see from the `fsw1` output above, its OSPF process has learned
+routes for the IPv4 and IPv6 loopbacks (100.64.0.0/32 and 2001:DB8::1/128 respectively),
+the connected routes (100.65.1.0/24 and 2001:DB8:1:1::/64 respectively), and the
+static routes (16.0.0.0/8 and 2001:db8:100::/40).
+
+So let's make use of this topology and change one of the two loadtesters to switch
+to L3 mode instead:
+
+```
+pim@hippo:~$ cat /etc/trex_cfg.yaml
+- version       : 2
+  interfaces    : ["0e:00.0", "0e:00.1" ]
+  port_bandwidth_gb: 25
+  port_limit    : 2
+  port_info     :
+    - ip         : 100.65.0.2
+      default_gw : 100.65.0.1
+    - ip         : 100.65.1.2
+      default_gw : 100.65.1.1
+```
+
+I left the loadtest running for 12hrs or so, and observed the results to be squeaky clean.
+The loadtester machine was generating ~96Gb/core at 20% utilization, so lazily generating
+40.00Gbit of traffic at 25.98Mpps (remember, this was setting the load to 80% on the 25G
+port, and 99% on the 10G ports). Looking at the switch and again being surprised about the
+discrepancy, I decided to fully explore the curiosity in this switch's utilization reporting.
+
+```
+fsw1#show interfaces usage | exclude 0.00
+Interface                            Bandwidth   Average Usage  Output Usage   Input Usage
+------------------------------------ ----------- -------------- -------------- -----------
+TenGigabitEthernet 0/1               10000  Mbit 93.80%         93.79%         93.81%
+TenGigabitEthernet 0/2               10000  Mbit 93.80%         93.79%         93.81%
+TFGigabitEthernet 0/24               25000  Mbit 75.80%         75.79%         75.81%
+FortyGigabitEthernet 0/26            40000  Mbit 94.79%         94.79%         94.79%
+
+fsw1#show int te0/1 | inc packets/sec
+   10 seconds input rate 9381044793 bits/sec, 3240802 packets/sec
+   10 seconds output rate 9378930906 bits/sec, 3240123 packets/sec
+
+fsw1#show int tf0/24 | inc packets/sec
+   10 seconds input rate 18952369793 bits/sec, 6547299 packets/sec
+   10 seconds output rate 18948317049 bits/sec, 6545915 packets/sec
+
+fsw1#show int fo0/26 | inc packets/sec
+   10 seconds input rate 37915517884 bits/sec, 13032078 packets/sec
+   10 seconds output rate 37915335102 bits/sec, 13026051 packets/sec
+
+```
+
+Looking at that number, 75.80% was not the 80% that I had asked for, and actually the usage of
+the 10G ports (which I put at 99% load) and 40G port are also lower than I had anticipated.
+What's going on there?  It's quite simple after doing some math: ***the switch is reporting
+L2 bits/sec, not L1 bits/sec!***
+
+On the L3 loadtest and using the `imix` profile, T-Rex is sending 13.02Mpps of load, which,
+according to its own observation is 37.8Gbit of L2 and 40.00Gbps of L1 bandwidth. On the L2
+loadtest, again using `imix` profile, T-Rex is sending 4x 3.24Mpps as well, which it claims
+is 37.6Gbps of L2 and 39.66Gbps of L1 bandwidth (note: I put the loadtester here at 99% of
+line rate, this is to ensure I would not end up losing packets due to congestion on the 40G
+port).
+
+So according to T-Rex, I am sending 75.4Gbps of traffic (37.8Gbps in the L2 test and 37.6Gbps
+in the simultenous L3 loadtest), yet I'm seeing 37.9Gbps on the switchport. Oh my!
+
+Here's how all of these numbers relate to one another:
+
+*   First off, we are sending 99% linerate at 3.24Mpps into Te0/1 and Te0/2 on each switch.
+*   Then, we are sending 80% linerate at 6.55Mpps into Tf0/24 on each switch.
+*   The Te0/1 and Te0/2 are both in the default VLAN on either side.
+*   But, the Tf0/24 is sending its IP traffic through VLAN 100 interconnect, which means all
+    of that traffic gets a dot1q VLAN tag added. That's 4 bytes for each packet.
+*   Sending 6.55Mpps * 32bits extra, equals 209600000 bits/sec (0.21Gbps)
+*   Loadtester claims 37.70Gbps, but the switch sees 37.91Gbps which is exactly the difference
+    we calculated above (0.21Gbps), and equals the overhead created by adding the VLAN tag on
+    the 25G stream that is in VLAN tag 100.
+
+Now we are ready to explain the difference between the switch reported port usage and the
+loadtester reported port usage:
+
+*   The loadtester is sending an `imix` traffic mix, which consts of a ratio of 28:16:4 of
+    packets that are 64:590:1514 bytes.
+*   We already know that to create a packet on the wire, we have to add a 7 byte preamble
+    a 1 byte start frame delimiter, and end with a 12 byte interpacket gap, so each ethernet
+    frame is 20 bytes longer, making 84 bytes the on-the-wire smallest possible frame.
+*   We know we're sending 3.24Mpps on a 10G port at 99% T-Rex (L1) usage:
+    *   Each packet needs 20 bytes or 160 bits of overhead, which is 518400000 bits/sec
+    *   We are seeing 9381044793 bits/sec on a 10G port (**corresponding switch 93.80% usage**)
+    *   Adding these two numbers up gives us 9899444793 bits/sec (**corresponding T-Rex 98.99% usage**)
+*   Conversely, the whole system is sending 37.9Gbps on the 40G port (**corresponding switch 37.9/40 == 94.79% usage**)
+    *   We know this is 2x 10G streams at 99% utilization and 1x25G stream at 80% utilization
+    *   This is 13.03Mpps, which generate 2084800000 bits/sec of overhead
+    *   Adding these two numbers up gives us 40.00 Gbps of usage (which is the expected L1 line rate)
+
+I find it very fulfilling to see these numbers meaningfully add up! Oh, and by the way,
+the switches that are now switching and routing all of this with with 0.00% packet loss,
+and the chassis doesn't even get warm :-)
+
+
+```
+Global Statistics
+
+connection   : localhost, Port 4501                       total_tx_L2  : 38.02 Gbps
+version      : STL @ v2.91                                total_tx_L1  : 40.02 Gbps
+cpu_util.    : 21.88% @ 4 cores (4 per dual port)         total_rx     : 38.02 Gbps
+rx_cpu_util. : 0.0% / 0 pps                               total_pps    : 13.13 Mpps
+async_util.  : 0% / 39.16 bps                             drop_rate    : 0 bps
+total_cps.   : 0 cps                                      queue_full   : 0 pkts
+
+Port Statistics
+
+   port    |         0         |         1         |       total
+-----------+-------------------+-------------------+------------------
+owner      |               pim |               pim |
+link       |                UP |                UP |
+state      |      TRANSMITTING |      TRANSMITTING |
+speed      |           25 Gb/s |           25 Gb/s |
+CPU util.  |            21.88% |            21.88% |
+--         |                   |                   |
+Tx bps L2  |        19.01 Gbps |        19.01 Gbps |        38.02 Gbps
+Tx bps L1  |        20.06 Gbps |        20.06 Gbps |        40.12 Gbps
+Tx pps     |         6.57 Mpps |         6.57 Mpps |        13.13 Mpps
+Line Util. |           80.23 % |           80.23 % |
+---        |                   |                   |
+Rx bps     |           19 Gbps |           19 Gbps |        38.01 Gbps
+Rx pps     |         6.56 Mpps |         6.56 Mpps |        13.13 Mpps
+----       |                   |                   |
+opackets   |      292215661081 |      292215652102 |      584431313183
+ipackets   |      292152912155 |      292153677482 |      584306589637
+obytes     |   105733412810506 |   105733412001676 |   211466824812182
+ibytes     |   105710857873526 |   105711223651650 |   211422081525176
+tx-pkts    |      292.22 Gpkts |      292.22 Gpkts |      584.43 Gpkts
+rx-pkts    |      292.15 Gpkts |      292.15 Gpkts |      584.31 Gpkts
+tx-bytes   |         105.73 TB |         105.73 TB |         211.47 TB
+rx-bytes   |         105.71 TB |         105.71 TB |         211.42 TB
+-----      |                   |                   |
+oerrors    |                 0 |                 0 |                 0
+ierrors    |                 0 |                 0 |                 0
+```
+
+### Conclusions
+
+It's just super cool to see a switch like this work as expected. I did not manage to
+overload it at all, neither with IPv4 loadtest at 20Mpps and 50Gbit of traffic, nor with
+L2 loadtest at 26Mpps and 80Gbit of traffic, with QinQ demonstrably done in hardware as
+well as IPv4 route lookups. I will be putting these switches into production soon on the
+IPng Networks links between Glattbrugg and R&uuml;mlang in Zurich, thereby upgrading our
+backbone from 10G to 25G CWDM. It seems to me, that using these switches as L3 devices
+given a smaller OSPF routing domain (currently, we have ~300 prefixes in our OSPF at
+AS50869), would definitely work well, as would pushing and popping QinQ trunks for our
+customers (for example on Solnet or Init7 or Openfactory).
+
+Approved. A+, will buy again.