--- date: "2021-11-26T08:51:23Z" title: 'Review: Netgate 6100' aliases: - /s/articles/2021/11/26/netgate-6100.html --- * Author: Pim van Pelt <[pim@ipng.nl](mailto:pim@ipng.nl)> * Reviewed: Jim Thompson <[jim@netgate.com](mailto:jim@netgate.com)> * Status: Draft - Review - **Approved** A few weeks ago, Jim Thompson from Netgate stumbled across my [APU6 Post]({{< ref "2021-07-19-pcengines-apu6" >}}) and introduced me to their new desktop router/firewall the Netgate 6100. It currently ships with [pfSense Plus](https://www.netgate.com/pfsense-plus-software), but he mentioned that it's designed as well to run their [TNSR](https://www.netgate.com/tnsr) software, considering the device ships with 2x 1GbE SFP/RJ45 combo, 2x 10GbE SFP+, and 4x 2.5GbE RJ45 ports, and all network interfaces are Intel / DPDK capable chips. He asked me if I was willing to take it around the block with VPP, which of course I'd be happy to do, and here are my findings. The TNSR image isn't yet public for this device, but that's not a problem because [AS8298 runs VPP]({{< ref "2021-09-10-vpp-6" >}}), so I'll just go ahead and install it myself ... # Executive Summary The Netgate 6100 router running pfSense has a single core performance of 623kpps and a total chassis throughput of 2.3Mpps, which is sufficient for line rate _in both directions_ at 1514b packets (1.58Mpps), about 6.2Gbit of _imix_ traffic, and about 419Mbit of 64b packets. Running Linux on the router yields very similar results. With VPP though, the router's single core performance leaps to 5.01Mpps at 438 CPU cycles/packet. This means that all three of 1514b, _imix_ and 64b packets can be forwarded at line rate in one direction on 10Gbit. Due to its Atom C3558 processor (which has 4 cores, 3 of which are dedicated to VPP's worker threads, and 1 to its main thread and controlplane running in LInux), achieving 10Gbit line rate in both directions when using 64 byte packets, is not possible. Running at 19W and a total forwarding **capacity of 15.1Mpps**, it consumes only **_1.26µJ_ of energy per forwarded packet**, while at the same time easily handling a full BGP table with room to spare. I find this Netgate 6100 appliance pretty impressive and when TNSR becomes available, performance will be similar to what I've tested here, at a pricetag of USD 699,- ## Detailed findings {{< image width="400px" float="left" src="/assets/netgate-6100/netgate-6100-back.png" alt="Netgate 6100" >}} The [Netgate 6100](https://www.netgate.com/blog/introducing-the-new-netgate-6100) ships with an Intel Atom C-3558 CPU (4 cores including AES-NI and QuickAssist), 8GB of memory and either 16GB of eMMC, or 128GB of NVME storage. The network cards are its main forté: it comes with 2x i354 gigabit combo (SFP and RJ45), 4x i225 ports (these are 2.5GbE), and 2x X553 10GbE ports with an SFP+ cage each, for a total of 8 ports and lots of connectivity. The machine is fanless and this is made possible by its power efficient CPU: the Atom here runs at 16W TDP only, and the whole machine clocks in at a very efficient 19W. It comes with an external power brick, but only one power supply, so no redundancy, unfortunately. To make up for that small omission, here are a few nice touches that I noticed: * The power jack has a screw-on barrel - no more accidentally rebooting the machine when fumbling around under the desk. * There's both a Cisco RJ45 console port (115200,8n1), as well as a CP2102 onboard USB/serial connector, which means you can connect to its serial port as well with a standard issue micro-USB cable. Cool! ### Battle of Operating Systems Netgate ships the device with pfSense - it's a pretty great appliance and massively popular - delivering firewall, router and VPN functionality to homes and small business across the globe. I myself am partial to BSD (albeit a bit more of the Puffy persuasion), but DPDK and VPP are more of a Linux cup of tea. So I'll have to deface this little guy, and reinstall it with Linux. My game plan is: 1. Based on the shipped pfSense 21.05 (FreeBSD 12.2), do all the loadtests 1. Reinstall the machine with Linux (Ubuntu 20.04.3), do all the loadtests 1. Install VPP using my own [HOWTO]({{< ref "2021-09-21-vpp-7" >}}), and do all the loadtests This allows for, I think, a pretty sweet comparison between FreeBSD, Linux, and DPDK/VPP. Now, on to a description on the defacing, err, reinstall process on this Netgate 6100 machine, as it was not as easy as I had anticipated (but is it ever easy, really?) {{< image width="400px" float="right" src="/assets/netgate-6100/blinkboot.png" alt="Blinkboot" >}} Turning on the device, it presents me with some BIOS firmware from Insyde Software which is loading some software called _BlinkBoot_ [[ref](https://www.insyde.com/products/blinkboot)], which in turn is loading modules called _Lenses_, pictured right. Anyway, this ultimately presents me with a ***Press F2 for Boot Options***. Aha! That's exactly what I'm looking for. I'm really grateful that Netgate decides to ship a device with a BIOS that will allow me to boot off of other media, notably the USB stick in order to [reinstall pfSense](https://docs.netgate.com/pfsense/en/latest/solutions/netgate-6100/reinstall-pfsense.html) but in my case, also to install another operating system entirely. My first approach was to get a default image to boot off of USB (the device has two USB3 ports on the side). But none of the USB ports want to load my UEFI `bootx64.efi` prepared USB key. So my second attempt was to prepare a PXE boot image, taking a few hints from Ubuntu's documentation [[ref](https://wiki.ubuntu.com/UEFI/PXE-netboot-install)]: ``` wget http://archive.ubuntu.com/ubuntu/dists/focal/main/installer-amd64/current/legacy-images/netboot/mini.iso mv mini.iso /tmp/mini-focal.iso grub-mkimage --format=x86_64-efi \ --output=/var/tftpboot/grubnetx64.efi.signed \ --memdisk=/tmp/mini-focal.iso \ `ls /usr/lib/grub/x86_64-efi | sed -n 's/\.mod//gp'` ``` After preparing DHCPd and a TFTP server, and getting a slight feeling of being transported back in time to the stone age, I see the PXE both request an IPv4 address, and the image I prepared. And, it boots, yippie! ``` Nov 25 14:52:10 spongebob dhcpd[43424]: DHCPDISCOVER from 90:ec:77:1b:63:55 via bond0 Nov 25 14:52:11 spongebob dhcpd[43424]: DHCPOFFER on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0 Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPREQUEST for 192.168.1.206 (192.168.1.254) from 90:ec:77:1b:63:55 via bond0 Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPACK on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0 Nov 25 15:04:56 spongebob tftpd[2076]: 192.168.1.206: read request for 'grubnetx64.efi.signed' ``` I took a peek while the `grubnetx64` was booting, and saw that the available output terminals on this machine are `spkmodem morse gfxterm serial_efi0 serial_com0 serial cbmemc audio`, and that the default/active one is `console`, so I make a note that Grub wants to run on 'console' (and specifically NOT on 'serial', as is usual, see below for a few more details on this) while the Linux kernel will of course be running on serial, so I have to add `console=ttyS0,115200n8` to the kernel boot string before booting. Piece of cake, by which I mean I spent about four hours staring at the boot loader and failing to get it quite right -- pro-tip: install OpenSSH and fix the GRUB and Kernel configs before finishing the `mini.iso` install: ``` mount --bind /proc /target/proc mount --bind /dev /target/dev mount --bind /sys /target/sys chroot /target /bin/bash # Install OpenSSH, otherwise the machine boots w/o access :) apt update apt install openssh-server # Fix serial for GRUB and Kernel vi /etc/default/grub ## set GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0,115200n8" ## set GRUB_TERMINAL=console (and comment out the serial stuff) grub-install /dev/sda update-grub ``` Rebooting now brings me to Ubuntu: Pat on the back, Caveman Pim, you've still got it! ## Network Loadtest After that small but exciting detour, let me get back to the loadtesting. The choice of Intel's network controller on this board allows me to use Intel's DPDK with relatively high performance, compared to regular (kernel) based routing. I loadtested the stock firmware pfSense (21.05, based on FreeBSD 12.2), Linux (Ubuntu 20.04.3), and VPP (22.02, [[ref](https://fd.io/)]). Specifically worth calling out is that while Linux and FreeBSD struggled in the packets-per-second department, the use of DPDK in VPP meant absolutely no problems filling a unidirectional 10G stream of "regular internet traffic" (referred to as `imix`), it was also able to fill _line rate_ with "64b UDP packets", with just a little headroom there, but it ultimately struggled with _bidirectional_ 64b UDP packets. ### Methodology For the loadtests, I used Cisco's T-Rex [[ref](https://trex-tgn.cisco.com/)] in stateless mode, with a custom Python controller that ramps up and down traffic from the loadtester to the _device under test_ (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back in on `port0`). The loadtester first sends a few seconds of warmup, this is to ensure the DUT is passing traffic and offers the ability to inspect the traffic before the actual rampup. Then the loadtester ramps up linearly from zero to 100% of line rate (in this case, line rate is 10Gbps in both directions), finally it holds the traffic at full line rate for a certain duration. If at any time the loadtester fails to see the traffic it's emitting return on its second port, it flags the DUT as saturated; and this is noted as the maximum bits/second and/or packets/second. Since my last loadtesting [post]({{< ref "2021-07-19-pcengines-apu6" >}}), I've learned a lot more about packet forwarding and how to make it easier or harder on the router. Let me go into a few more details about the various loadtests that I've done here. #### Method 1: Single CPU Thread Saturation Most kernels (certainly OpenBSD, FreeBSD and Linux) will make use of multiple receive queues if the network card supports it. The Intel NICs in this machine are all capable of _Receive Side Scaling_ (RSS), which means the NIC can offload its packets into multiple queues. The kernel will typically enable one queue for each CPU core -- the Atom has 4 cores, so 4 queues are initialized, and inbound traffic is sent, typically using some hashing function, to individual CPUs, allowing for a higher aggregate throughput. Mostly, this hashing function is based on some L3 or L4 payload, for example a hash over the source IP/port and destination IP/port. So one interesting test is to send **the same packet** over and over again -- the hash function will then return the same value for each packet, which means all traffic goes into exactly one of the N available queues, and therefore handled by only one core. One such TRex stateless traffic profile is `udp_1pkt_simple.py` which, as the name implies, simply sends the same UDP packet from source IP/port and destination IP/port, padded with a bunch of 'x' characters, over and over again: ``` packet = STLPktBuilder(pkt = Ether()/ IP(src="16.0.0.1",dst="48.0.0.1")/ UDP(dport=12,sport=1025)/(10*'x') ) ``` #### Method 2: Rampup using trex-loadtest.py TRex ships with a very handy `bench.py` stateless traffic profile which, without any additional arguments, does the same thing as the above method. However, this profile optionally takes a few arguments, which are called _tunables_, notably: * ***size*** - set the size of the packets to either a number (ie. 64, the default, or any number up to a maximum of 9216 byes), or the string `imix` which will send a traffic mix consisting of 60b, 590b and 1514b packets. * ***vm*** - set the packet source/dest generation. By default (when the flag is `None`), the same src (16.0.0.1) and dst (48.0.0.1) is set for each packet. When setting the value to `var1`, the source IP is incremented from `16.0.0.[4-254]`. If the value is set to `var2`, the source _and_ destination IP are incremented, the destination from `48.0.0.[4-254]`. So tinkering with the `vm` parameter is an excellent way of driving one or many receive queues. Armed with this, I will perform a loadtest with four modes of interest, from easier to more challenging: 1. ***bench-var2-1514b***: multiple flows, ~815Kpps at 10Gbps; this is the easiest test to perform, as the traffic consists of large (1514 byte) packets, and both source and destination are different each time, which means lots of multiplexing across receive queues, and relatively few packets/sec. 1. ***bench-var2-imix***: multiple flows, with a mix of 60, 590 and 1514b frames in a certain ratio. This yields what can be reasonably expected from _normal internet use_, just about 3.2Mpps at 10Gbps. This is the most representative test for normal use, but still the packet rate is quite low due to (relatively) large packets. Any respectable router should be able to perform well at an imix profile. 1. ***bench-var2-64b***: Still multiple flows, but very small packets, 14.88Mpps at 10Gbps, often refered to as the theoretical maximum throughput on Tengig. Now it's getting harder, as the loadtester will fill the line with small packets (of 64 bytes, the smallest that an ethernet packet is allowed to be). This is a good way to see if the router vendor is actually capable of what is referred to as _line rate_ forwarding. 1. ***bench***: Now restricted to a constant src/dst IP:port tuple, and the same rate of 14.88Mpps at 10Gbps, means only one Rx queue (and thus, one CPU core) can be used. This is where single-core performance becomes relevant. Notably, vendors who boast many CPUs, will often struggle with a test like this, in case any given CPU core cannot individually handle a full line rate. I'm looking at you, Tilera! Further to this list, I can send traffic in one direction only (TRex will emit this from its port0 and expect the traffic to be seen back at port1); or I can send it ***in both directions***. The latter will double the packet rate and bandwidth, to approx 29.7Mpps. ***NOTE***: At these rates, TRex can be a bit fickle trying to fit all these packets into its own transmit queues, so I decide to drive it a bit less close to the cliff and stop at 97% of line rate (this is 28.3Mpps). It explains why lots of these loadtests top out at that number. ### Results #### Method 1: Single CPU Thread Saturation Given the approaches above, for the first method I can "just" saturate the line and see how many packets emerge through the DUT on the other port, so that's only 3 tests: Netgate 6100 | Loadtest | Throughput (pps) | L1 Throughput (bps) | % of linerate ------------- | --------------- | ---------------- | -------------------- | ------------- pfSense | 64b 1-flow | 622.98 Kpps | 418.65 Mbps | 4.19% Linux | 64b 1-flow | 642.71 Kpps | 431.90 Mbps | 4.32% ***VPP*** | ***64b 1-flow***| ***5.01 Mpps*** | ***3.37 Gbps*** | ***33.67%*** ***NOTE***: The bandwidth figures here are so called _L1 throughput_ which means bits on the wire, as opposed to _L2 throughput_ which means bits in the ethernet frame. This is relevant particularly at 64b loadtests as the overhead for each ethernet frame is 20 bytes (7 bytes preamble, 1 byte start-frame, and 12 bytes inter-frame gap [[ref](https://en.wikipedia.org/wiki/Ethernet_frame)]). At 64 byte frames, this is 31.25% overhead! It also means that when L1 bandwidth is fully consumed at 10Gbps, that the observed L2 bandwidth will be only 7.62Gbps. #### Interlude - VPP efficiency In VPP it can be pretty cool to take a look at efficiency -- one of the main reasons why it's so quick is because VPP will consume the entire core, and grab ***a set of packets*** from the NIC rather than do work for each individual packet. VPP then advances the set of packets, called a _vector_, through a directed graph. The first of these packets will result in the code for the current graph node to be fetched into the CPU's instruction cache, and the second and further packets will make use of the warmed up cache, greatly improving per-packet efficiency. I can demonstrate this by running a 1kpps, 1Mpps and 10Mpps test against the VPP install on this router, and observing how many CPU cycles each packet needs to get forwarded from the input interface to the output interface. I expect this number _to go down_ when the machine has more work to do, due to the higher CPU i/d-cache hit rate. Seeing the time spent in each of VPP's graph nodes, and for each individual worker thread (which correspond 1:1 with CPU cores), can be done with `vppctl show runtime` command and some `awk` magic: ``` $ vppctl clear run && sleep 30 && vppctl show run | \ awk '$2 ~ /active|polling/ && $4 > 25000 { print $0; if ($1=="ethernet-input") { packets = $4}; if ($1=="dpdk-input") { dpdk_time = $6}; total_time += $6 } END { print packets/30, "packets/sec, at",total_time,"cycles/packet,", total_time-dpdk_time,"cycles/packet not counting DPDK" }' ``` This gives me the following, somewhat verbose but super interesting output, which I've edited down to fit on screen, and omit the columns that are not super relevant. Ready? Here we go! ``` tui>start -f stl/udp_1pkt_simple.py -p 0 -m 1kpps Graph Node Name Clocks Vectors/Call ---------------------------------------------------------------- TenGigabitEthernet3/0/1-output 6.07e2 1.00 TenGigabitEthernet3/0/1-tx 8.61e2 1.00 dpdk-input 1.51e6 0.00 ethernet-input 1.22e3 1.00 ip4-input-no-checksum 6.59e2 1.00 ip4-load-balance 4.50e2 1.00 ip4-lookup 5.63e2 1.00 ip4-rewrite 5.83e2 1.00 1000.17 packets/sec, at 1514943 cycles/packet, 4943 cycles/pkt not counting DPDK ``` I'll observe that a lot of time is spent in `dpdk-input`, because that is a node that is constantly polling the network card, as fast as it can, to see if there's any work for it to do. Apparently not, because the average vectors per call is pretty much zero, and considering that, most of the CPU time is going to sit in a nice "do nothing". Because reporting CPU cycles spent doing nothing isn't particularly interesting, I shall report on both the total cycles spent, that is to say including DPDK, and as well the cycles spent per packet in the _other active_ nodes. In this case, at 1kpps, VPP is spending 4953 cycles on each packet. Now, take a look what happens when I raise the traffic to 1Mpps: ``` tui>start -f stl/udp_1pkt_simple.py -p 0 -m 1mpps Graph Node Name Clocks Vectors/Call ---------------------------------------------------------------- TenGigabitEthernet3/0/1-output 3.80e1 18.57 TenGigabitEthernet3/0/1-tx 1.44e2 18.57 dpdk-input 1.15e3 .39 ethernet-input 1.39e2 18.57 ip4-input-no-checksum 8.26e1 18.57 ip4-load-balance 5.85e1 18.57 ip4-lookup 7.94e1 18.57 ip4-rewrite 7.86e1 18.57 981830 packets/sec, at 1770.1 cycles/packet, 620 cycles/pkt not counting DPDK ``` Whoa! The system is now running the VPP loop with ~18.6 packets per vector, and you can clearly see that the CPU efficiency went up greatly, from 4953 cycles/packet at 1kpps, to 620 cycles/packet at 1Mpps. That's an order of magnitude improvement! Finally, let's give this Netgate 6100 router a run for its money, and slam it with 10Mpps: ``` tui>start -f stl/udp_1pkt_simple.py -p 0 -m 10mpps Graph Node Name Clocks Vectors/Call ---------------------------------------------------------------- TenGigabitEthernet3/0/1-output 1.41e1 256.00 TenGigabitEthernet3/0/1-tx 1.23e2 256.00 dpdk-input 7.95e1 256.00 ethernet-input 6.74e1 256.00 ip4-input-no-checksum 3.95e1 256.00 ip4-load-balance 2.54e1 256.00 ip4-lookup 4.12e1 256.00 ip4-rewrite 4.78e1 256.00 5.01426e+06 packets/sec, at 437.9 cycles/packet, 358 cycles/pkt not counting DPDK ``` And here is where I learn the maximum packets/sec that this one CPU thread can handle: ***5.01Mpps***, at which point every packet is super efficiently handled at 358 CPU cycles each, or 13.8 times (4953/438) as efficient under high load than when the CPU is unloaded. Sweet!! Another really cool thing to do here is derive the effective clock speed of the Atom CPU. We know it runs at 2200Mhz, and we're doing 5.01Mpps at 438 cycles/packet including the time spent in DPDK, which adds up to 2194MHz, remarkable precision. Color me impressed :-) #### Method 2: Rampup using trex-loadtest.py For the second methodology, I have to perform a _lot_ of loadtests. In total, I'm testing 4 modes (1514b, imix, 64b-multi and 64b 1-flow), then take a look at unidirectional traffic and bidirectional traffic, and perform each of these loadtests on pfSense, Ubuntu, and VPP with one, two or three Rx/Tx queues. That's a total of 40 loadtests! Loadtest | pfSense | Ubuntu | VPP 1Q | VPP 2Q | VPP 3Q | Details --------- | -------- | -------- | ------- | ------- | ------- | ---------- ***Unidirectional*** | 1514b | 97% | 97% | 97% | 97% | ***97%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-1514b-unidirectional.html)] imix | 61% | 75% | 96% | 95% | ***95%*** | [[graphs](/assets/netgate-6100/netgate-6100.imix-unidirectional.html)] 64b | 15% | 17% | 33% | 66% | ***96%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-64b-unidirectional.html)] 64b 1-flow| 4.4% | 4.7% | 33% | 33% | ***33%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-unidirectional.html)] ***Bidirectional*** | 1514b | 192% | 193% | 193% | 193% | ***194%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-1514b-bidirectional.html)] imix | 63% | 71% | 190% | 190% | ***191%*** | [[graphs](/assets/netgate-6100/netgate-6100.imix-bidirectional.html)] 64b | 15% | 16% | 61% | 63% | ***81%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-64b-bidirectional.html)] 64b 1-flow| 8.6% | 9.0% | 61% | ***61%*** | 33% (+) | [[graphs](/assets/netgate-6100/netgate-6100.bench-bidirectional.html)] A picture says a thousand words - so I invite you to take a look at the interactive graphs from the table above. I'll cherrypick what I find the most interesting one here: {{< image width="1000px" src="/assets/netgate-6100/bench-var2-64b-unidirectional.png" alt="Netgate 6100" >}} The graph above is of the unidirectional _64b_ loadtest. Some observations: * pfSense 21.05 (running FreeBSD 12.2, the bottom blue trace), and Ubuntu 20.04.3 (running Linux 5.13, the orange trace just above it) are are equal performers. They handle fullsized (1514 byte) packets just fine, struggle a little bit with imix, and completely suck at 64b packets (shown here), in particular if only 1 CPU core can be used. * Even at 64b packets, VPP scales linearly from 33% of line rate with 1Q (the green trace), 66% with 2Q (the red trace) and 96% with 3Q (the purple trace, that makes it through to the end). * With VPP taking 3Q, one CPU is left over for the main thread and controlplane software like FRR or Bird2. ## Caveats The unit was shipped courtesy of Netgate (thanks again! Jim, this was fun!) for the purposes of load- and systems integration testing and comparing their internal benchmarking with my findings. Other than that, this is not a paid endorsement and views of this review are my own. One quirk I noticed is that while running VPP with 3Q and bidirectional traffic, performance is much worse than with 2Q or 1Q. This is not a fluke with the loadtest, as I have observed the same strange performance with other machines (Supermicro 5018D-FN8T for example). I confirmed that each VPP worker thread is used for each queue, so I would've expected ~15Mpps shared by both interfaces (so a per-direction linerate of ~50%), but I get 16.8% instead [[graphs](/assets/netgate-6100/netgate-6100.bench-bidirectional.html)]. I'll have to understand that better, but for now I'm releasing the data as-is. ## Appendix ### Generating the data You can find all of my loadtest runs in [this archive](/assets/netgate-6100/trex-loadtest-json.tar.gz). The archive contains the `trex-loadtest.py` script as well, for curious readers! These JSON files can be fed directly into Michal's [visualizer](https://github.com/wejn/trex-loadtest-viz) to plot interactive graphs (which I've done for the table above): ``` DEVICE=netgate-6100 ruby graph.rb -t 'Netgate 6100 All Loadtests' -o ${DEVICE}.html netgate-*.json for i in bench-var2-1514b bench-var2-64b bench imix; do ruby graph.rb -t 'Netgate 6100 Unidirectional Loadtests' --only-channels 0 \ netgate-*-${i}-unidi*.json -o ${DEVICE}.$i-unidirectional.html done for i in bench-var2-1514b bench-var2-64b bench imix; do ruby graph.rb -t 'Netgate 6100 Bidirectional Loadtests' \ netgate-*-${i}.json -o ${DEVICE}.$i-bidirectional.html done ``` ### Notes on pfSense I'm not a pfSense user, but I know my way around FreeBSD just fine. After installing the firmware, I simply choose the 'Give me a Shell' option, and take it from there. The router will run `pf` out of the box, and it is pretty complex, so I'll just configure some addresses, routes and disable the firewall alltogether. That sounds just fair, as the same tests with Linux and VPP also do not use a firewall (even though obviously, both VPP and Linux support firewalls just fine). ``` ifconfig ix0 inet 100.65.1.1/24 ifconfig ix1 inet 100.65.2.1/24 route add -net 16.0.0.0/8 100.65.1.2 route add -net 48.0.0.0/8 100.65.2.2 pfctl -d ``` ### Notes on Linux When doing loadtests on Ubuntu, I have to ensure irqbalance is turned off, otherwise the kernel will thrash around re-routing softirq's between CPU threads, and at the end of the day, I'm trying to saturate all CPUs anyway, so balancing/moving them around doesn't make any sense. Further, Linux wants to configure a static ARP entry for the interfaces from TRex: ``` sudo systemctl disable irqbalance sudo systemctl stop irqbalance sudo systemctl mask irqbalance sudo ip addr add 100.65.1.1/24 dev enp3s0f0 sudo ip addr add 100.65.2.1/24 dev enp3s0f1 sudo ip nei replace 100.65.1.2 lladdr 68:05:ca:32:45:94 dev enp3s0f0 ## TRex port0 sudo ip nei replace 100.65.2.2 lladdr 68:05:ca:32:45:95 dev enp3s0f1 ## TRex port1 sudo ip ro add 16.0.0.0/8 via 100.65.1.2 sudo ip ro add 48.0.0.0/8 via 100.65.2.2 ``` On Linux, I now see a reasonable spread of IRQs by CPU while doing a unidirectional loadtest: ``` root@netgate:/home/pim# cat /proc/softirqs CPU0 CPU1 CPU2 CPU3 HI: 3 0 0 1 TIMER: 203788 247280 259544 401401 NET_TX: 8956 8373 7836 6154 NET_RX: 22003822 19316480 22526729 19430299 BLOCK: 2545 3153 2430 1463 IRQ_POLL: 0 0 0 0 TASKLET: 5084 60 1830 23 SCHED: 137647 117482 56371 49112 HRTIMER: 0 0 0 0 RCU: 11550 9023 8975 8075 ```