Rewrite all images to Hugo format

2024-08-05 01:11:52 +02:00
parent 4230fd9acc
commit c1f1775c91
67 changed files with 29916 additions and 23 deletions
--- a/content/articles/2021-11-26-netgate-6100.md
+++ b/content/articles/2021-11-26-netgate-6100.md
@@ -0,0 +1,474 @@
+---
+date: "2021-11-26T08:51:23Z"
+title: 'Review: Netgate 6100'
+---
+
+*    Author: Pim van Pelt <[pim@ipng.nl](mailto:pim@ipng.nl)>
+*    Reviewed: Jim Thompson <[jim@netgate.com](mailto:jim@netgate.com)>
+*    Status: Draft - Review - **Approved**
+
+A few weeks ago, Jim Thompson from Netgate stumbled across my [APU6 Post]({% post_url 2021-07-19-pcengines-apu6 %})
+and introduced me to their new desktop router/firewall the Netgate 6100. It currently ships with
+[pfSense Plus](https://www.netgate.com/pfsense-plus-software), but he mentioned that it's designed
+as well to run their [TNSR](https://www.netgate.com/tnsr) software, considering the device ships
+with 2x 1GbE SFP/RJ45 combo, 2x 10GbE SFP+, and 4x 2.5GbE RJ45 ports, and all network interfaces
+are Intel / DPDK capable chips. He asked me if I was willing to take it around the block with
+VPP, which of course I'd be happy to do, and here are my findings. The TNSR image isn't yet
+public for this device, but that's not a problem because [AS8298 runs VPP]({% post_url 2021-09-10-vpp-6 %}),
+so I'll just go ahead and install it myself ...
+
+# Executive Summary
+
+The Netgate 6100 router running pfSense has a single core performance of 623kpps and a total chassis
+throughput of 2.3Mpps, which is sufficient for line rate _in both directions_ at 1514b packets (1.58Mpps),
+about 6.2Gbit of _imix_ traffic, and about 419Mbit of 64b packets. Running Linux on the router yields
+very similar results.
+
+With VPP though, the router's single core performance leaps to 5.01Mpps at 438 CPU cycles/packet. This
+means that all three of 1514b, _imix_ and 64b packets can be forwarded at line rate in one direction
+on 10Gbit. Due to its Atom C3558 processor (which has 4 cores, 3 of which are dedicated to VPP's worker
+threads, and 1 to its main thread and controlplane running in LInux), achieving 10Gbit line rate in
+both directions when using 64 byte packets, is not possible.
+
+Running at 19W and a total forwarding **capacity of 15.1Mpps**, it consumes only **_1.26&micro;J_ of
+energy per forwarded packet**, while at the same time easily handling a full BGP table with room to
+spare. I find this Netgate 6100 appliance pretty impressive and when TNSR becomes available,
+performance will be similar to what I've tested here, at a pricetag of USD 699,-
+
+## Detailed findings
+
+{{< image width="400px" float="left" src="/assets/netgate-6100/netgate-6100-back.png" alt="Netgate 6100" >}}
+
+The [Netgate 6100](https://www.netgate.com/blog/introducing-the-new-netgate-6100) ships with an
+Intel Atom C-3558 CPU (4 cores including AES-NI and QuickAssist), 8GB of memory and either 16GB of
+eMMC, or 128GB of NVME storage. The network cards are its main fort&eacute;: it comes with 2x i354
+gigabit combo (SFP and RJ45), 4x i225 ports (these are 2.5GbE), and 2x X553 10GbE ports with an SFP+
+cage each, for a total of 8 ports and lots of connectivity.
+
+The machine is fanless and this is made possible by its power efficient CPU: the Atom here runs at 16W
+TDP only, and the whole machine clocks in at a very efficient 19W. It comes with an external power brick,
+but only one power supply, so no redundancy, unfortunately. To make up for that small omission, here are
+a few nice touches that I noticed:
+*   The power jack has a screw-on barrel - no more accidentally rebooting the machine when fumbling around under the desk.
+*   There's both a Cisco RJ45 console port (115200,8n1), as well as a CP2102 onboard USB/serial connector,
+    which means you can connect to its serial port as well with a standard issue micro-USB cable. Cool!
+
+### Battle of Operating Systems
+
+Netgate ships the device with pfSense - it's a pretty great appliance and massively popular - delivering
+firewall, router and VPN functionality to homes and small business across the globe. I myself am partial
+to BSD (albeit a bit more of the Puffy persuasion), but DPDK and VPP are more of a Linux cup of tea. So
+I'll have to deface this little guy, and reinstall it with Linux. My game plan is:
+
+1.   Based on the shipped pfSense 21.05 (FreeBSD 12.2), do all the loadtests
+1.   Reinstall the machine with Linux (Ubuntu 20.04.3), do all the loadtests
+1.   Install VPP using my own [HOWTO]({% post_url 2021-09-21-vpp-7 %}), and do all the loadtests
+
+This allows for, I think, a pretty sweet comparison between FreeBSD, Linux, and DPDK/VPP. Now, on to a
+description on the defacing, err, reinstall process on this Netgate 6100 machine, as it was not as easy
+as I had anticipated (but is it ever easy, really?)
+
+{{< image width="400px" float="right" src="/assets/netgate-6100/blinkboot.png" alt="Blinkboot" >}}
+
+Turning on the device, it presents me with some BIOS firmware from Insyde Software which
+is loading some software called _BlinkBoot_ [[ref](https://www.insyde.com/products/blinkboot)], which
+in turn is loading modules called _Lenses_, pictured right. Anyway, this ultimately presents
+me with a ***Press F2 for Boot Options***. Aha! That's exactly what I'm looking for. I'm really
+grateful that Netgate decides to ship a device with a BIOS that will allow me to boot off of other
+media, notably the USB stick in order to [reinstall pfSense](https://docs.netgate.com/pfsense/en/latest/solutions/netgate-6100/reinstall-pfsense.html)
+but in my case, also to install another operating system entirely.
+
+My first approach was to get a default image to boot off of USB (the device has two USB3 ports on the
+side). But none of the USB ports want to load my UEFI `bootx64.efi` prepared USB key. So my second
+attempt was to prepare a PXE boot image, taking a few hints from Ubuntu's documentation [[ref](https://wiki.ubuntu.com/UEFI/PXE-netboot-install)]:
+
+```
+wget http://archive.ubuntu.com/ubuntu/dists/focal/main/installer-amd64/current/legacy-images/netboot/mini.iso
+mv mini.iso /tmp/mini-focal.iso
+grub-mkimage --format=x86_64-efi  \
+                --output=/var/tftpboot/grubnetx64.efi.signed   \
+                --memdisk=/tmp/mini-focal.iso  \
+                `ls /usr/lib/grub/x86_64-efi  | sed -n 's/\.mod//gp'`
+```
+
+After preparing DHCPd and a TFTP server, and getting a slight feeling of being transported back in time
+to the stone age, I see the PXE both request an IPv4 address, and the image I prepared. And, it boots, yippie!
+
+```
+Nov 25 14:52:10 spongebob dhcpd[43424]: DHCPDISCOVER from 90:ec:77:1b:63:55 via bond0
+Nov 25 14:52:11 spongebob dhcpd[43424]: DHCPOFFER on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0
+Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPREQUEST for 192.168.1.206 (192.168.1.254) from 90:ec:77:1b:63:55 via bond0
+Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPACK on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0
+Nov 25 15:04:56 spongebob tftpd[2076]: 192.168.1.206: read request for 'grubnetx64.efi.signed'
+```
+
+I took a peek while the `grubnetx64` was booting, and saw that the available output terminals
+on this machine are `spkmodem morse gfxterm serial_efi0 serial_com0 serial cbmemc audio`, and that
+the default/active one is `console`, so I make a note that Grub wants to run on 'console' (and
+specifically NOT on 'serial', as is usual, see below for a few more details on this) while the Linux
+kernel will of course be running on serial, so I have to add `console=ttyS0,115200n8` to the kernel boot
+string before booting.
+
+Piece of cake, by which I mean I spent about four hours staring at the boot loader and failing to get
+it quite right -- pro-tip: install OpenSSH and fix the GRUB and Kernel configs before finishing the
+`mini.iso` install:
+
+```
+mount --bind /proc /target/proc
+mount --bind /dev /target/dev
+mount --bind /sys /target/sys
+chroot /target /bin/bash
+
+# Install OpenSSH, otherwise the machine boots w/o access :)
+apt update
+apt install openssh-server
+
+# Fix serial for GRUB and Kernel
+vi /etc/default/grub
+## set GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0,115200n8"
+## set GRUB_TERMINAL=console (and comment out the serial stuff)
+grub-install /dev/sda
+update-grub
+```
+
+Rebooting now brings me to Ubuntu: Pat on the back, Caveman Pim, you've still got it!
+
+## Network Loadtest
+
+After that small but exciting detour, let me get back to the loadtesting. The choice of Intel's
+network controller on this board allows me to use Intel's DPDK with relatively high
+performance, compared to regular (kernel) based routing. I loadtested the stock firmware pfSense
+(21.05, based on FreeBSD 12.2), Linux (Ubuntu 20.04.3), and VPP (22.02, [[ref](https://fd.io/)]).
+
+Specifically worth calling out is that while Linux and FreeBSD struggled in the packets-per-second
+department, the use of DPDK in VPP meant absolutely no problems filling a unidirectional 10G stream
+of "regular internet traffic" (referred to as `imix`), it was also able to fill _line rate_ with
+"64b UDP packets", with just a little headroom there, but it ultimately struggled with _bidirectional_
+64b UDP packets.
+
+### Methodology
+
+For the loadtests, I used Cisco's T-Rex [[ref](https://trex-tgn.cisco.com/)] in stateless mode,
+with a custom Python controller that ramps up and down traffic from the loadtester to the _device
+under test_ (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be
+presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back
+in on `port0`). The loadtester first sends a few seconds of warmup, this is to ensure the DUT is
+passing traffic and offers the ability to inspect the traffic before the actual rampup. Then
+the loadtester ramps up linearly from zero to 100% of line rate (in this case, line rate is
+10Gbps in both directions), finally it holds the traffic at full line rate for a certain
+duration. If at any time the loadtester fails to see the traffic it's emitting return on its
+second port, it flags the DUT as saturated; and this is noted as the maximum bits/second and/or
+packets/second.
+
+Since my last loadtesting [post]({% post_url 2021-07-19-pcengines-apu6 %}), I've learned a lot
+more about packet forwarding and how to make it easier or harder on the router. Let me go into a
+few more details about the various loadtests that I've done here.
+
+#### Method 1: Single CPU Thread Saturation
+
+Most kernels (certainly OpenBSD, FreeBSD and Linux) will make use of multiple receive queues
+if the network card supports it. The Intel NICs in this machine are all capable of _Receive Side
+Scaling_ (RSS), which means the NIC can offload its packets into multiple queues. The kernel
+will typically enable one queue for each CPU core -- the Atom has 4 cores, so 4 queues are
+initialized, and inbound traffic is sent, typically using some hashing function, to individual
+CPUs, allowing for a higher aggregate throughput.
+
+Mostly, this hashing function is based on some L3 or L4 payload, for example a hash over
+the source IP/port and destination IP/port. So one interesting test is to send **the same packet**
+over and over again -- the hash function will then return the same value for each packet, which
+means all traffic goes into exactly one of the N available queues, and therefore handled by
+only one core.
+
+One such TRex stateless traffic profile is `udp_1pkt_simple.py` which, as the name implies,
+simply sends the same UDP packet from source IP/port and destination IP/port, padded with
+a bunch of 'x' characters, over and over again:
+
+```
+  packet = STLPktBuilder(pkt =
+              Ether()/
+              IP(src="16.0.0.1",dst="48.0.0.1")/
+              UDP(dport=12,sport=1025)/(10*'x')
+           )
+```
+
+#### Method 2: Rampup using trex-loadtest.py
+
+TRex ships with a very handy `bench.py` stateless traffic profile which, without any additional
+arguments, does the same thing as the above method. However, this profile optionally takes a few
+arguments, which are called _tunables_, notably:
+*   ***size*** - set the size of the packets to either a number (ie. 64, the default, or any number
+    up to a maximum of 9216 byes), or the string `imix` which will send a traffic mix consisting of
+    60b, 590b and 1514b packets.
+*   ***vm*** - set the packet source/dest generation. By default (when the flag is `None`), the
+    same src (16.0.0.1) and dst (48.0.0.1) is set for each packet. When setting the value to
+    `var1`, the source IP is incremented from `16.0.0.[4-254]`. If the value is set to
+    `var2`, the source _and_ destination IP are incremented, the destination from `48.0.0.[4-254]`.
+
+So tinkering with the `vm` parameter is an excellent way of driving one or many receive queues. Armed
+with this, I will perform a loadtest with four modes of interest, from easier to more challenging:
+1.   ***bench-var2-1514b***: multiple flows, ~815Kpps at 10Gbps; this is the easiest test to perform,
+     as the traffic consists of large (1514 byte) packets, and both source and destination are
+     different each time, which means lots of multiplexing across receive queues, and relatively few
+     packets/sec.
+1.   ***bench-var2-imix***: multiple flows, with a mix of 60, 590 and 1514b frames in a certain ratio. This
+     yields what can be reasonably expected from _normal internet use_, just about 3.2Mpps at
+     10Gbps. This is the most representative test for normal use, but still the packet rate is quite
+     low due to (relatively) large packets. Any respectable router should be able to perform well at
+     an imix profile.
+1.   ***bench-var2-64b***: Still multiple flows, but very small packets, 14.88Mpps at 10Gbps,
+     often refered to as the theoretical maximum throughput on Tengig. Now it's getting harder, as
+     the loadtester will fill the line with small packets (of 64 bytes, the smallest that an ethernet
+     packet is allowed to be). This is a good way to see if the router vendor is actually capable of
+     what is referred to as _line rate_ forwarding.
+1.   ***bench***: Now restricted to a constant src/dst IP:port tuple, and the same rate of
+     14.88Mpps at 10Gbps, means only one Rx queue (and thus, one CPU core) can be used. This is where
+     single-core performance becomes relevant. Notably, vendors who boast many CPUs, will often struggle
+     with a test like this, in case any given CPU core cannot individually handle a full line rate.
+     I'm looking at you, Tilera!
+
+Further to this list, I can send traffic in one direction only (TRex will emit this from its
+port0 and expect the traffic to be seen back at port1); or I can send it ***in both directions***.
+The latter will double the packet rate and bandwidth, to approx 29.7Mpps.
+
+***NOTE***: At these rates, TRex can be a bit fickle trying to fit all these packets into its own
+transmit queues, so I decide to drive it a bit less close to the cliff and stop at 97% of line rate
+(this is 28.3Mpps). It explains why lots of these loadtests top out at that number.
+
+### Results
+
+#### Method 1: Single CPU Thread Saturation
+
+Given the approaches above, for the first method I can "just" saturate the line and see how many packets
+emerge through the DUT on the other port, so that's only 3 tests:
+
+Netgate 6100  | Loadtest        | Throughput (pps) | L1 Throughput (bps)  | % of linerate
+------------- | --------------- | ---------------- | -------------------- | -------------
+pfSense       | 64b 1-flow      | 622.98 Kpps      | 418.65 Mbps          | 4.19%
+Linux         | 64b 1-flow      | 642.71 Kpps      | 431.90 Mbps          | 4.32%
+***VPP***     | ***64b 1-flow***| ***5.01 Mpps***  | ***3.37 Gbps***      | ***33.67%***
+
+***NOTE***: The bandwidth figures here are so called _L1 throughput_ which means bits on the wire, as opposed
+to _L2 throughput_ which means bits in the ethernet frame. This is relevant particularly at 64b loadtests as
+the overhead for each ethernet frame is 20 bytes (7 bytes preamble, 1 byte start-frame, and 12 bytes inter-frame gap
+[[ref](https://en.wikipedia.org/wiki/Ethernet_frame)]). At 64 byte frames, this is 31.25% overhead! It also
+means that when L1 bandwidth is fully consumed at 10Gbps, that the observed L2 bandwidth will be only 7.62Gbps.
+
+#### Interlude - VPP efficiency
+
+In VPP it can be pretty cool to take a look at efficiency -- one of the main reasons why it's so quick is because
+VPP will consume the entire core, and grab ***a set of packets*** from the NIC rather than do work for each
+individual packet. VPP then advances the set of packets, called a _vector_, through a directed graph. The first
+of these packets will result in the code for the current graph node to be fetched into the CPU's instruction cache,
+and the second and further packets will make use of the warmed up cache, greatly improving per-packet efficiency.
+
+I can demonstrate this by running a 1kpps, 1Mpps and 10Mpps test against the VPP install on this router, and
+observing how many CPU cycles each packet needs to get forwarded from the input interface to the output interface.
+I expect this number _to go down_ when the machine has more work to do, due to the higher CPU i/d-cache hit rate.
+Seeing the time spent in each of VPP's graph nodes, and for each individual worker thread (which correspond 1:1
+with CPU cores), can be done with `vppctl show runtime` command and some `awk` magic:
+
+```
+$ vppctl clear run && sleep 30 && vppctl show run | \
+    awk '$2 ~ /active|polling/ && $4 > 25000 {
+      print $0;
+      if ($1=="ethernet-input") { packets = $4};
+      if ($1=="dpdk-input") { dpdk_time = $6};
+      total_time += $6
+    } END {
+      print packets/30, "packets/sec, at",total_time,"cycles/packet,",
+            total_time-dpdk_time,"cycles/packet not counting DPDK"
+    }'
+```
+
+This gives me the following, somewhat verbose but super interesting output, which I've edited down to fit on screen,
+and omit the columns that are not super relevant.  Ready? Here we go!
+
+```
+tui>start -f stl/udp_1pkt_simple.py -p 0 -m 1kpps
+Graph Node Name                  Clocks          Vectors/Call
+----------------------------------------------------------------
+TenGigabitEthernet3/0/1-output   6.07e2            1.00
+TenGigabitEthernet3/0/1-tx       8.61e2            1.00
+dpdk-input                       1.51e6            0.00
+ethernet-input                   1.22e3            1.00
+ip4-input-no-checksum            6.59e2            1.00
+ip4-load-balance                 4.50e2            1.00
+ip4-lookup                       5.63e2            1.00
+ip4-rewrite                      5.83e2            1.00
+1000.17 packets/sec, at 1514943 cycles/packet, 4943 cycles/pkt not counting DPDK
+```
+
+I'll observe that a lot of time is spent in `dpdk-input`, because that is a node that is constantly polling
+the network card, as fast as it can, to see if there's any work for it to do. Apparently not, because the average
+vectors per call is pretty much zero, and considering that, most of the CPU time is going to sit in a nice "do
+nothing". Because reporting CPU cycles spent doing nothing isn't particularly interesting, I shall report on
+both the total cycles spent, that is to say including DPDK, and as well the cycles spent per packet in the
+_other active_ nodes. In this case, at 1kpps, VPP is spending 4953 cycles on each packet.
+
+Now, take a look what happens when I raise the traffic to 1Mpps:
+
+```
+tui>start -f stl/udp_1pkt_simple.py -p 0 -m 1mpps
+Graph Node Name                  Clocks          Vectors/Call
+----------------------------------------------------------------
+TenGigabitEthernet3/0/1-output   3.80e1           18.57
+TenGigabitEthernet3/0/1-tx       1.44e2           18.57
+dpdk-input                       1.15e3             .39
+ethernet-input                   1.39e2           18.57
+ip4-input-no-checksum            8.26e1           18.57
+ip4-load-balance                 5.85e1           18.57
+ip4-lookup                       7.94e1           18.57
+ip4-rewrite                      7.86e1           18.57
+981830 packets/sec, at 1770.1 cycles/packet, 620 cycles/pkt not counting DPDK
+```
+
+Whoa! The system is now running the VPP loop with ~18.6 packets per vector, and you can clearly see that
+the CPU efficiency went up greatly, from 4953 cycles/packet at 1kpps, to 620 cycles/packet at 1Mpps.
+That's an order of magnitude improvement!
+
+Finally, let's give this Netgate 6100 router a run for its money, and slam it with 10Mpps:
+
+```
+tui>start -f stl/udp_1pkt_simple.py -p 0 -m 10mpps
+Graph Node Name                  Clocks          Vectors/Call
+----------------------------------------------------------------
+TenGigabitEthernet3/0/1-output   1.41e1          256.00
+TenGigabitEthernet3/0/1-tx       1.23e2          256.00
+dpdk-input                       7.95e1          256.00
+ethernet-input                   6.74e1          256.00
+ip4-input-no-checksum            3.95e1          256.00
+ip4-load-balance                 2.54e1          256.00
+ip4-lookup                       4.12e1          256.00
+ip4-rewrite                      4.78e1          256.00
+5.01426e+06 packets/sec, at 437.9 cycles/packet, 358 cycles/pkt not counting DPDK
+```
+
+And here is where I learn the maximum packets/sec that this one CPU thread can handle: ***5.01Mpps***, at which
+point every packet is super efficiently handled at 358 CPU cycles each, or 13.8 times (4953/438)
+as efficient under high load than when the CPU is unloaded. Sweet!!
+
+Another really cool thing to do here is derive the effective clock speed of the Atom CPU. We know it runs at
+2200Mhz, and we're doing 5.01Mpps at 438 cycles/packet including the time spent in DPDK, which adds up to 2194MHz,
+remarkable precision. Color me impressed :-)
+
+#### Method 2: Rampup using trex-loadtest.py
+
+For the second methodology, I have to perform a _lot_ of loadtests. In total, I'm testing 4 modes (1514b, imix,
+64b-multi and 64b 1-flow), then take a look at unidirectional traffic and bidirectional traffic, and perform each
+of these loadtests on pfSense, Ubuntu, and VPP with one, two or three Rx/Tx queues. That's a total of 40
+loadtests!
+
+Loadtest  | pfSense  | Ubuntu   | VPP 1Q  | VPP 2Q  | VPP 3Q  | Details
+--------- | -------- | -------- | ------- | ------- | ------- | ----------
+***Unidirectional*** |
+1514b     | 97%      | 97%      | 97%     | 97%     | ***97%***  | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-1514b-unidirectional.html)]
+imix      | 61%      | 75%      | 96%     | 95%     | ***95%***  | [[graphs](/assets/netgate-6100/netgate-6100.imix-unidirectional.html)]
+64b       | 15%      | 17%      | 33%     | 66%     | ***96%***  | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-64b-unidirectional.html)]
+64b 1-flow| 4.4%     | 4.7%     | 33%     | 33%     | ***33%***  | [[graphs](/assets/netgate-6100/netgate-6100.bench-unidirectional.html)]
+***Bidirectional***  |
+1514b     | 192%     | 193%     | 193%    | 193%    | ***194%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-1514b-bidirectional.html)]
+imix      | 63%      | 71%      | 190%    | 190%    | ***191%*** | [[graphs](/assets/netgate-6100/netgate-6100.imix-bidirectional.html)]
+64b       | 15%      | 16%      | 61%     | 63%     | ***81%***  | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-64b-bidirectional.html)]
+64b 1-flow| 8.6%     | 9.0%     | 61%     | ***61%*** | 33% (+)  | [[graphs](/assets/netgate-6100/netgate-6100.bench-bidirectional.html)]
+
+A picture says a thousand words - so I invite you to take a look at the interactive graphs from the table
+above. I'll cherrypick what I find the most interesting one here:
+
+{{< image width="1000px" src="/assets/netgate-6100/bench-var2-64b-unidirectional.png" alt="Netgate 6100" >}}
+
+The graph above is of the unidirectional _64b_ loadtest. Some observations:
+*   pfSense 21.05 (running FreeBSD 12.2, the bottom blue trace), and Ubuntu 20.04.3 (running Linux 5.13, the
+    orange trace just above it) are are equal performers. They handle fullsized (1514 byte) packets just fine,
+    struggle a little bit with imix, and completely suck at 64b packets (shown here), in particular if only 1
+    CPU core can be used.
+*   Even at 64b packets, VPP scales linearly from 33% of line rate with 1Q (the green trace), 66% with 2Q (the
+    red trace) and 96% with 3Q (the purple trace, that makes it through to the end).
+*   With VPP taking 3Q, one CPU is left over for the main thread and controlplane software like FRR or Bird2.
+
+## Caveats
+
+The unit was shipped courtesy of Netgate (thanks again! Jim, this was fun!) for the purposes of
+load- and systems integration testing and comparing their internal benchmarking with my findings.
+Other than that, this is not a paid endorsement and views of this review are my own.
+
+One quirk I noticed is that while running VPP with 3Q and bidirectional traffic, performance is much worse
+than with 2Q or 1Q. This is not a fluke with the loadtest, as I have observed the same strange performance
+with other machines (Supermicro 5018D-FN8T for example). I confirmed that each VPP worker thread is used
+for each queue, so I would've expected ~15Mpps shared by both interfaces (so a per-direction linerate of ~50%),
+but I get 16.8% instead [[graphs](/assets/netgate-6100/netgate-6100.bench-bidirectional.html)]. I'll have to
+understand that better, but for now I'm releasing the data as-is.
+
+## Appendix
+
+### Generating the data
+
+You can find all of my loadtest runs in [this archive](/assets/netgate-6100/trex-loadtest-json.tar.gz).
+The archive contains the `trex-loadtest.py` script as well, for curious readers!
+These JSON files can be fed directly into Michal's [visualizer](https://github.com/wejn/trex-loadtest-viz)
+to plot interactive graphs (which I've done for the table above):
+
+```
+DEVICE=netgate-6100
+ruby graph.rb -t 'Netgate 6100 All Loadtests' -o ${DEVICE}.html netgate-*.json
+for i in bench-var2-1514b bench-var2-64b bench imix; do
+  ruby graph.rb -t 'Netgate 6100 Unidirectional Loadtests' --only-channels 0 \
+      netgate-*-${i}-unidi*.json -o ${DEVICE}.$i-unidirectional.html
+done
+for i in bench-var2-1514b bench-var2-64b bench imix; do
+  ruby graph.rb -t 'Netgate 6100 Bidirectional Loadtests' \
+      netgate-*-${i}.json -o ${DEVICE}.$i-bidirectional.html
+done
+```
+
+### Notes on pfSense
+
+I'm not a pfSense user, but I know my way around FreeBSD just fine. After installing the firmware, I
+simply choose the 'Give me a Shell' option, and take it from there. The router will run `pf` out of
+the box, and it is pretty complex, so I'll just configure some addresses, routes and disable the
+firewall alltogether. That sounds just fair, as the same tests with Linux and VPP also do not use
+a firewall (even though obviously, both VPP and Linux support firewalls just fine).
+
+```
+ifconfig ix0 inet 100.65.1.1/24
+ifconfig ix1 inet 100.65.2.1/24
+route add -net 16.0.0.0/8 100.65.1.2
+route add -net 48.0.0.0/8 100.65.2.2
+pfctl -d
+```
+
+### Notes on Linux
+
+When doing loadtests on Ubuntu, I have to ensure irqbalance is turned off, otherwise the kernel will
+thrash around re-routing softirq's between CPU threads, and at the end of the day, I'm trying to saturate
+all CPUs anyway, so balancing/moving them around doesn't make any sense. Further, Linux wants to configure
+a static ARP entry for the interfaces from TRex:
+
+```
+sudo systemctl disable irqbalance
+sudo systemctl stop irqbalance
+sudo systemctl mask irqbalance
+
+sudo ip addr add 100.65.1.1/24 dev enp3s0f0
+sudo ip addr add 100.65.2.1/24 dev enp3s0f1
+sudo ip nei replace 100.65.1.2 lladdr 68:05:ca:32:45:94 dev enp3s0f0 ## TRex port0
+sudo ip nei replace 100.65.2.2 lladdr 68:05:ca:32:45:95 dev enp3s0f1 ## TRex port1
+sudo ip ro add 16.0.0.0/8 via 100.65.1.2
+sudo ip ro add 48.0.0.0/8 via 100.65.2.2
+```
+
+On Linux, I now see a reasonable spread of IRQs by CPU while doing a unidirectional loadtest:
+```
+root@netgate:/home/pim# cat /proc/softirqs
+                    CPU0       CPU1       CPU2       CPU3
+          HI:          3          0          0          1
+       TIMER:     203788     247280     259544     401401
+      NET_TX:       8956       8373       7836       6154
+      NET_RX:   22003822   19316480   22526729   19430299
+       BLOCK:       2545       3153       2430       1463
+    IRQ_POLL:          0          0          0          0
+     TASKLET:       5084         60       1830         23
+       SCHED:     137647     117482      56371      49112
+     HRTIMER:          0          0          0          0
+         RCU:      11550       9023       8975       8075
+```
+