477 lines
28 KiB
Markdown
477 lines
28 KiB
Markdown
---
|
|
date: "2021-11-26T08:51:23Z"
|
|
title: 'Review: Netgate 6100'
|
|
aliases:
|
|
- /s/articles/2021/11/26/netgate-6100.html
|
|
---
|
|
|
|
* Author: Pim van Pelt <[pim@ipng.nl](mailto:pim@ipng.nl)>
|
|
* Reviewed: Jim Thompson <[jim@netgate.com](mailto:jim@netgate.com)>
|
|
* Status: Draft - Review - **Approved**
|
|
|
|
A few weeks ago, Jim Thompson from Netgate stumbled across my [APU6 Post]({{< ref "2021-07-19-pcengines-apu6" >}})
|
|
and introduced me to their new desktop router/firewall the Netgate 6100. It currently ships with
|
|
[pfSense Plus](https://www.netgate.com/pfsense-plus-software), but he mentioned that it's designed
|
|
as well to run their [TNSR](https://www.netgate.com/tnsr) software, considering the device ships
|
|
with 2x 1GbE SFP/RJ45 combo, 2x 10GbE SFP+, and 4x 2.5GbE RJ45 ports, and all network interfaces
|
|
are Intel / DPDK capable chips. He asked me if I was willing to take it around the block with
|
|
VPP, which of course I'd be happy to do, and here are my findings. The TNSR image isn't yet
|
|
public for this device, but that's not a problem because [AS8298 runs VPP]({{< ref "2021-09-10-vpp-6" >}}),
|
|
so I'll just go ahead and install it myself ...
|
|
|
|
# Executive Summary
|
|
|
|
The Netgate 6100 router running pfSense has a single core performance of 623kpps and a total chassis
|
|
throughput of 2.3Mpps, which is sufficient for line rate _in both directions_ at 1514b packets (1.58Mpps),
|
|
about 6.2Gbit of _imix_ traffic, and about 419Mbit of 64b packets. Running Linux on the router yields
|
|
very similar results.
|
|
|
|
With VPP though, the router's single core performance leaps to 5.01Mpps at 438 CPU cycles/packet. This
|
|
means that all three of 1514b, _imix_ and 64b packets can be forwarded at line rate in one direction
|
|
on 10Gbit. Due to its Atom C3558 processor (which has 4 cores, 3 of which are dedicated to VPP's worker
|
|
threads, and 1 to its main thread and controlplane running in LInux), achieving 10Gbit line rate in
|
|
both directions when using 64 byte packets, is not possible.
|
|
|
|
Running at 19W and a total forwarding **capacity of 15.1Mpps**, it consumes only **_1.26µJ_ of
|
|
energy per forwarded packet**, while at the same time easily handling a full BGP table with room to
|
|
spare. I find this Netgate 6100 appliance pretty impressive and when TNSR becomes available,
|
|
performance will be similar to what I've tested here, at a pricetag of USD 699,-
|
|
|
|
## Detailed findings
|
|
|
|
{{< image width="400px" float="left" src="/assets/netgate-6100/netgate-6100-back.png" alt="Netgate 6100" >}}
|
|
|
|
The [Netgate 6100](https://www.netgate.com/blog/introducing-the-new-netgate-6100) ships with an
|
|
Intel Atom C-3558 CPU (4 cores including AES-NI and QuickAssist), 8GB of memory and either 16GB of
|
|
eMMC, or 128GB of NVME storage. The network cards are its main forté: it comes with 2x i354
|
|
gigabit combo (SFP and RJ45), 4x i225 ports (these are 2.5GbE), and 2x X553 10GbE ports with an SFP+
|
|
cage each, for a total of 8 ports and lots of connectivity.
|
|
|
|
The machine is fanless and this is made possible by its power efficient CPU: the Atom here runs at 16W
|
|
TDP only, and the whole machine clocks in at a very efficient 19W. It comes with an external power brick,
|
|
but only one power supply, so no redundancy, unfortunately. To make up for that small omission, here are
|
|
a few nice touches that I noticed:
|
|
* The power jack has a screw-on barrel - no more accidentally rebooting the machine when fumbling around under the desk.
|
|
* There's both a Cisco RJ45 console port (115200,8n1), as well as a CP2102 onboard USB/serial connector,
|
|
which means you can connect to its serial port as well with a standard issue micro-USB cable. Cool!
|
|
|
|
### Battle of Operating Systems
|
|
|
|
Netgate ships the device with pfSense - it's a pretty great appliance and massively popular - delivering
|
|
firewall, router and VPN functionality to homes and small business across the globe. I myself am partial
|
|
to BSD (albeit a bit more of the Puffy persuasion), but DPDK and VPP are more of a Linux cup of tea. So
|
|
I'll have to deface this little guy, and reinstall it with Linux. My game plan is:
|
|
|
|
1. Based on the shipped pfSense 21.05 (FreeBSD 12.2), do all the loadtests
|
|
1. Reinstall the machine with Linux (Ubuntu 20.04.3), do all the loadtests
|
|
1. Install VPP using my own [HOWTO]({{< ref "2021-09-21-vpp-7" >}}), and do all the loadtests
|
|
|
|
This allows for, I think, a pretty sweet comparison between FreeBSD, Linux, and DPDK/VPP. Now, on to a
|
|
description on the defacing, err, reinstall process on this Netgate 6100 machine, as it was not as easy
|
|
as I had anticipated (but is it ever easy, really?)
|
|
|
|
{{< image width="400px" float="right" src="/assets/netgate-6100/blinkboot.png" alt="Blinkboot" >}}
|
|
|
|
Turning on the device, it presents me with some BIOS firmware from Insyde Software which
|
|
is loading some software called _BlinkBoot_ [[ref](https://www.insyde.com/products/blinkboot)], which
|
|
in turn is loading modules called _Lenses_, pictured right. Anyway, this ultimately presents
|
|
me with a ***Press F2 for Boot Options***. Aha! That's exactly what I'm looking for. I'm really
|
|
grateful that Netgate decides to ship a device with a BIOS that will allow me to boot off of other
|
|
media, notably the USB stick in order to [reinstall pfSense](https://docs.netgate.com/pfsense/en/latest/solutions/netgate-6100/reinstall-pfsense.html)
|
|
but in my case, also to install another operating system entirely.
|
|
|
|
My first approach was to get a default image to boot off of USB (the device has two USB3 ports on the
|
|
side). But none of the USB ports want to load my UEFI `bootx64.efi` prepared USB key. So my second
|
|
attempt was to prepare a PXE boot image, taking a few hints from Ubuntu's documentation [[ref](https://wiki.ubuntu.com/UEFI/PXE-netboot-install)]:
|
|
|
|
```
|
|
wget http://archive.ubuntu.com/ubuntu/dists/focal/main/installer-amd64/current/legacy-images/netboot/mini.iso
|
|
mv mini.iso /tmp/mini-focal.iso
|
|
grub-mkimage --format=x86_64-efi \
|
|
--output=/var/tftpboot/grubnetx64.efi.signed \
|
|
--memdisk=/tmp/mini-focal.iso \
|
|
`ls /usr/lib/grub/x86_64-efi | sed -n 's/\.mod//gp'`
|
|
```
|
|
|
|
After preparing DHCPd and a TFTP server, and getting a slight feeling of being transported back in time
|
|
to the stone age, I see the PXE both request an IPv4 address, and the image I prepared. And, it boots, yippie!
|
|
|
|
```
|
|
Nov 25 14:52:10 spongebob dhcpd[43424]: DHCPDISCOVER from 90:ec:77:1b:63:55 via bond0
|
|
Nov 25 14:52:11 spongebob dhcpd[43424]: DHCPOFFER on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0
|
|
Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPREQUEST for 192.168.1.206 (192.168.1.254) from 90:ec:77:1b:63:55 via bond0
|
|
Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPACK on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0
|
|
Nov 25 15:04:56 spongebob tftpd[2076]: 192.168.1.206: read request for 'grubnetx64.efi.signed'
|
|
```
|
|
|
|
I took a peek while the `grubnetx64` was booting, and saw that the available output terminals
|
|
on this machine are `spkmodem morse gfxterm serial_efi0 serial_com0 serial cbmemc audio`, and that
|
|
the default/active one is `console`, so I make a note that Grub wants to run on 'console' (and
|
|
specifically NOT on 'serial', as is usual, see below for a few more details on this) while the Linux
|
|
kernel will of course be running on serial, so I have to add `console=ttyS0,115200n8` to the kernel boot
|
|
string before booting.
|
|
|
|
Piece of cake, by which I mean I spent about four hours staring at the boot loader and failing to get
|
|
it quite right -- pro-tip: install OpenSSH and fix the GRUB and Kernel configs before finishing the
|
|
`mini.iso` install:
|
|
|
|
```
|
|
mount --bind /proc /target/proc
|
|
mount --bind /dev /target/dev
|
|
mount --bind /sys /target/sys
|
|
chroot /target /bin/bash
|
|
|
|
# Install OpenSSH, otherwise the machine boots w/o access :)
|
|
apt update
|
|
apt install openssh-server
|
|
|
|
# Fix serial for GRUB and Kernel
|
|
vi /etc/default/grub
|
|
## set GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0,115200n8"
|
|
## set GRUB_TERMINAL=console (and comment out the serial stuff)
|
|
grub-install /dev/sda
|
|
update-grub
|
|
```
|
|
|
|
Rebooting now brings me to Ubuntu: Pat on the back, Caveman Pim, you've still got it!
|
|
|
|
## Network Loadtest
|
|
|
|
After that small but exciting detour, let me get back to the loadtesting. The choice of Intel's
|
|
network controller on this board allows me to use Intel's DPDK with relatively high
|
|
performance, compared to regular (kernel) based routing. I loadtested the stock firmware pfSense
|
|
(21.05, based on FreeBSD 12.2), Linux (Ubuntu 20.04.3), and VPP (22.02, [[ref](https://fd.io/)]).
|
|
|
|
Specifically worth calling out is that while Linux and FreeBSD struggled in the packets-per-second
|
|
department, the use of DPDK in VPP meant absolutely no problems filling a unidirectional 10G stream
|
|
of "regular internet traffic" (referred to as `imix`), it was also able to fill _line rate_ with
|
|
"64b UDP packets", with just a little headroom there, but it ultimately struggled with _bidirectional_
|
|
64b UDP packets.
|
|
|
|
### Methodology
|
|
|
|
For the loadtests, I used Cisco's T-Rex [[ref](https://trex-tgn.cisco.com/)] in stateless mode,
|
|
with a custom Python controller that ramps up and down traffic from the loadtester to the _device
|
|
under test_ (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be
|
|
presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back
|
|
in on `port0`). The loadtester first sends a few seconds of warmup, this is to ensure the DUT is
|
|
passing traffic and offers the ability to inspect the traffic before the actual rampup. Then
|
|
the loadtester ramps up linearly from zero to 100% of line rate (in this case, line rate is
|
|
10Gbps in both directions), finally it holds the traffic at full line rate for a certain
|
|
duration. If at any time the loadtester fails to see the traffic it's emitting return on its
|
|
second port, it flags the DUT as saturated; and this is noted as the maximum bits/second and/or
|
|
packets/second.
|
|
|
|
Since my last loadtesting [post]({{< ref "2021-07-19-pcengines-apu6" >}}), I've learned a lot
|
|
more about packet forwarding and how to make it easier or harder on the router. Let me go into a
|
|
few more details about the various loadtests that I've done here.
|
|
|
|
#### Method 1: Single CPU Thread Saturation
|
|
|
|
Most kernels (certainly OpenBSD, FreeBSD and Linux) will make use of multiple receive queues
|
|
if the network card supports it. The Intel NICs in this machine are all capable of _Receive Side
|
|
Scaling_ (RSS), which means the NIC can offload its packets into multiple queues. The kernel
|
|
will typically enable one queue for each CPU core -- the Atom has 4 cores, so 4 queues are
|
|
initialized, and inbound traffic is sent, typically using some hashing function, to individual
|
|
CPUs, allowing for a higher aggregate throughput.
|
|
|
|
Mostly, this hashing function is based on some L3 or L4 payload, for example a hash over
|
|
the source IP/port and destination IP/port. So one interesting test is to send **the same packet**
|
|
over and over again -- the hash function will then return the same value for each packet, which
|
|
means all traffic goes into exactly one of the N available queues, and therefore handled by
|
|
only one core.
|
|
|
|
One such TRex stateless traffic profile is `udp_1pkt_simple.py` which, as the name implies,
|
|
simply sends the same UDP packet from source IP/port and destination IP/port, padded with
|
|
a bunch of 'x' characters, over and over again:
|
|
|
|
```
|
|
packet = STLPktBuilder(pkt =
|
|
Ether()/
|
|
IP(src="16.0.0.1",dst="48.0.0.1")/
|
|
UDP(dport=12,sport=1025)/(10*'x')
|
|
)
|
|
```
|
|
|
|
#### Method 2: Rampup using trex-loadtest.py
|
|
|
|
TRex ships with a very handy `bench.py` stateless traffic profile which, without any additional
|
|
arguments, does the same thing as the above method. However, this profile optionally takes a few
|
|
arguments, which are called _tunables_, notably:
|
|
* ***size*** - set the size of the packets to either a number (ie. 64, the default, or any number
|
|
up to a maximum of 9216 byes), or the string `imix` which will send a traffic mix consisting of
|
|
60b, 590b and 1514b packets.
|
|
* ***vm*** - set the packet source/dest generation. By default (when the flag is `None`), the
|
|
same src (16.0.0.1) and dst (48.0.0.1) is set for each packet. When setting the value to
|
|
`var1`, the source IP is incremented from `16.0.0.[4-254]`. If the value is set to
|
|
`var2`, the source _and_ destination IP are incremented, the destination from `48.0.0.[4-254]`.
|
|
|
|
So tinkering with the `vm` parameter is an excellent way of driving one or many receive queues. Armed
|
|
with this, I will perform a loadtest with four modes of interest, from easier to more challenging:
|
|
1. ***bench-var2-1514b***: multiple flows, ~815Kpps at 10Gbps; this is the easiest test to perform,
|
|
as the traffic consists of large (1514 byte) packets, and both source and destination are
|
|
different each time, which means lots of multiplexing across receive queues, and relatively few
|
|
packets/sec.
|
|
1. ***bench-var2-imix***: multiple flows, with a mix of 60, 590 and 1514b frames in a certain ratio. This
|
|
yields what can be reasonably expected from _normal internet use_, just about 3.2Mpps at
|
|
10Gbps. This is the most representative test for normal use, but still the packet rate is quite
|
|
low due to (relatively) large packets. Any respectable router should be able to perform well at
|
|
an imix profile.
|
|
1. ***bench-var2-64b***: Still multiple flows, but very small packets, 14.88Mpps at 10Gbps,
|
|
often refered to as the theoretical maximum throughput on Tengig. Now it's getting harder, as
|
|
the loadtester will fill the line with small packets (of 64 bytes, the smallest that an ethernet
|
|
packet is allowed to be). This is a good way to see if the router vendor is actually capable of
|
|
what is referred to as _line rate_ forwarding.
|
|
1. ***bench***: Now restricted to a constant src/dst IP:port tuple, and the same rate of
|
|
14.88Mpps at 10Gbps, means only one Rx queue (and thus, one CPU core) can be used. This is where
|
|
single-core performance becomes relevant. Notably, vendors who boast many CPUs, will often struggle
|
|
with a test like this, in case any given CPU core cannot individually handle a full line rate.
|
|
I'm looking at you, Tilera!
|
|
|
|
Further to this list, I can send traffic in one direction only (TRex will emit this from its
|
|
port0 and expect the traffic to be seen back at port1); or I can send it ***in both directions***.
|
|
The latter will double the packet rate and bandwidth, to approx 29.7Mpps.
|
|
|
|
***NOTE***: At these rates, TRex can be a bit fickle trying to fit all these packets into its own
|
|
transmit queues, so I decide to drive it a bit less close to the cliff and stop at 97% of line rate
|
|
(this is 28.3Mpps). It explains why lots of these loadtests top out at that number.
|
|
|
|
### Results
|
|
|
|
#### Method 1: Single CPU Thread Saturation
|
|
|
|
Given the approaches above, for the first method I can "just" saturate the line and see how many packets
|
|
emerge through the DUT on the other port, so that's only 3 tests:
|
|
|
|
Netgate 6100 | Loadtest | Throughput (pps) | L1 Throughput (bps) | % of linerate
|
|
------------- | --------------- | ---------------- | -------------------- | -------------
|
|
pfSense | 64b 1-flow | 622.98 Kpps | 418.65 Mbps | 4.19%
|
|
Linux | 64b 1-flow | 642.71 Kpps | 431.90 Mbps | 4.32%
|
|
***VPP*** | ***64b 1-flow***| ***5.01 Mpps*** | ***3.37 Gbps*** | ***33.67%***
|
|
|
|
***NOTE***: The bandwidth figures here are so called _L1 throughput_ which means bits on the wire, as opposed
|
|
to _L2 throughput_ which means bits in the ethernet frame. This is relevant particularly at 64b loadtests as
|
|
the overhead for each ethernet frame is 20 bytes (7 bytes preamble, 1 byte start-frame, and 12 bytes inter-frame gap
|
|
[[ref](https://en.wikipedia.org/wiki/Ethernet_frame)]). At 64 byte frames, this is 31.25% overhead! It also
|
|
means that when L1 bandwidth is fully consumed at 10Gbps, that the observed L2 bandwidth will be only 7.62Gbps.
|
|
|
|
#### Interlude - VPP efficiency
|
|
|
|
In VPP it can be pretty cool to take a look at efficiency -- one of the main reasons why it's so quick is because
|
|
VPP will consume the entire core, and grab ***a set of packets*** from the NIC rather than do work for each
|
|
individual packet. VPP then advances the set of packets, called a _vector_, through a directed graph. The first
|
|
of these packets will result in the code for the current graph node to be fetched into the CPU's instruction cache,
|
|
and the second and further packets will make use of the warmed up cache, greatly improving per-packet efficiency.
|
|
|
|
I can demonstrate this by running a 1kpps, 1Mpps and 10Mpps test against the VPP install on this router, and
|
|
observing how many CPU cycles each packet needs to get forwarded from the input interface to the output interface.
|
|
I expect this number _to go down_ when the machine has more work to do, due to the higher CPU i/d-cache hit rate.
|
|
Seeing the time spent in each of VPP's graph nodes, and for each individual worker thread (which correspond 1:1
|
|
with CPU cores), can be done with `vppctl show runtime` command and some `awk` magic:
|
|
|
|
```
|
|
$ vppctl clear run && sleep 30 && vppctl show run | \
|
|
awk '$2 ~ /active|polling/ && $4 > 25000 {
|
|
print $0;
|
|
if ($1=="ethernet-input") { packets = $4};
|
|
if ($1=="dpdk-input") { dpdk_time = $6};
|
|
total_time += $6
|
|
} END {
|
|
print packets/30, "packets/sec, at",total_time,"cycles/packet,",
|
|
total_time-dpdk_time,"cycles/packet not counting DPDK"
|
|
}'
|
|
```
|
|
|
|
This gives me the following, somewhat verbose but super interesting output, which I've edited down to fit on screen,
|
|
and omit the columns that are not super relevant. Ready? Here we go!
|
|
|
|
```
|
|
tui>start -f stl/udp_1pkt_simple.py -p 0 -m 1kpps
|
|
Graph Node Name Clocks Vectors/Call
|
|
----------------------------------------------------------------
|
|
TenGigabitEthernet3/0/1-output 6.07e2 1.00
|
|
TenGigabitEthernet3/0/1-tx 8.61e2 1.00
|
|
dpdk-input 1.51e6 0.00
|
|
ethernet-input 1.22e3 1.00
|
|
ip4-input-no-checksum 6.59e2 1.00
|
|
ip4-load-balance 4.50e2 1.00
|
|
ip4-lookup 5.63e2 1.00
|
|
ip4-rewrite 5.83e2 1.00
|
|
1000.17 packets/sec, at 1514943 cycles/packet, 4943 cycles/pkt not counting DPDK
|
|
```
|
|
|
|
I'll observe that a lot of time is spent in `dpdk-input`, because that is a node that is constantly polling
|
|
the network card, as fast as it can, to see if there's any work for it to do. Apparently not, because the average
|
|
vectors per call is pretty much zero, and considering that, most of the CPU time is going to sit in a nice "do
|
|
nothing". Because reporting CPU cycles spent doing nothing isn't particularly interesting, I shall report on
|
|
both the total cycles spent, that is to say including DPDK, and as well the cycles spent per packet in the
|
|
_other active_ nodes. In this case, at 1kpps, VPP is spending 4953 cycles on each packet.
|
|
|
|
Now, take a look what happens when I raise the traffic to 1Mpps:
|
|
|
|
```
|
|
tui>start -f stl/udp_1pkt_simple.py -p 0 -m 1mpps
|
|
Graph Node Name Clocks Vectors/Call
|
|
----------------------------------------------------------------
|
|
TenGigabitEthernet3/0/1-output 3.80e1 18.57
|
|
TenGigabitEthernet3/0/1-tx 1.44e2 18.57
|
|
dpdk-input 1.15e3 .39
|
|
ethernet-input 1.39e2 18.57
|
|
ip4-input-no-checksum 8.26e1 18.57
|
|
ip4-load-balance 5.85e1 18.57
|
|
ip4-lookup 7.94e1 18.57
|
|
ip4-rewrite 7.86e1 18.57
|
|
981830 packets/sec, at 1770.1 cycles/packet, 620 cycles/pkt not counting DPDK
|
|
```
|
|
|
|
Whoa! The system is now running the VPP loop with ~18.6 packets per vector, and you can clearly see that
|
|
the CPU efficiency went up greatly, from 4953 cycles/packet at 1kpps, to 620 cycles/packet at 1Mpps.
|
|
That's an order of magnitude improvement!
|
|
|
|
Finally, let's give this Netgate 6100 router a run for its money, and slam it with 10Mpps:
|
|
|
|
```
|
|
tui>start -f stl/udp_1pkt_simple.py -p 0 -m 10mpps
|
|
Graph Node Name Clocks Vectors/Call
|
|
----------------------------------------------------------------
|
|
TenGigabitEthernet3/0/1-output 1.41e1 256.00
|
|
TenGigabitEthernet3/0/1-tx 1.23e2 256.00
|
|
dpdk-input 7.95e1 256.00
|
|
ethernet-input 6.74e1 256.00
|
|
ip4-input-no-checksum 3.95e1 256.00
|
|
ip4-load-balance 2.54e1 256.00
|
|
ip4-lookup 4.12e1 256.00
|
|
ip4-rewrite 4.78e1 256.00
|
|
5.01426e+06 packets/sec, at 437.9 cycles/packet, 358 cycles/pkt not counting DPDK
|
|
```
|
|
|
|
And here is where I learn the maximum packets/sec that this one CPU thread can handle: ***5.01Mpps***, at which
|
|
point every packet is super efficiently handled at 358 CPU cycles each, or 13.8 times (4953/438)
|
|
as efficient under high load than when the CPU is unloaded. Sweet!!
|
|
|
|
Another really cool thing to do here is derive the effective clock speed of the Atom CPU. We know it runs at
|
|
2200Mhz, and we're doing 5.01Mpps at 438 cycles/packet including the time spent in DPDK, which adds up to 2194MHz,
|
|
remarkable precision. Color me impressed :-)
|
|
|
|
#### Method 2: Rampup using trex-loadtest.py
|
|
|
|
For the second methodology, I have to perform a _lot_ of loadtests. In total, I'm testing 4 modes (1514b, imix,
|
|
64b-multi and 64b 1-flow), then take a look at unidirectional traffic and bidirectional traffic, and perform each
|
|
of these loadtests on pfSense, Ubuntu, and VPP with one, two or three Rx/Tx queues. That's a total of 40
|
|
loadtests!
|
|
|
|
Loadtest | pfSense | Ubuntu | VPP 1Q | VPP 2Q | VPP 3Q | Details
|
|
--------- | -------- | -------- | ------- | ------- | ------- | ----------
|
|
***Unidirectional*** |
|
|
1514b | 97% | 97% | 97% | 97% | ***97%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-1514b-unidirectional.html)]
|
|
imix | 61% | 75% | 96% | 95% | ***95%*** | [[graphs](/assets/netgate-6100/netgate-6100.imix-unidirectional.html)]
|
|
64b | 15% | 17% | 33% | 66% | ***96%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-64b-unidirectional.html)]
|
|
64b 1-flow| 4.4% | 4.7% | 33% | 33% | ***33%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-unidirectional.html)]
|
|
***Bidirectional*** |
|
|
1514b | 192% | 193% | 193% | 193% | ***194%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-1514b-bidirectional.html)]
|
|
imix | 63% | 71% | 190% | 190% | ***191%*** | [[graphs](/assets/netgate-6100/netgate-6100.imix-bidirectional.html)]
|
|
64b | 15% | 16% | 61% | 63% | ***81%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-64b-bidirectional.html)]
|
|
64b 1-flow| 8.6% | 9.0% | 61% | ***61%*** | 33% (+) | [[graphs](/assets/netgate-6100/netgate-6100.bench-bidirectional.html)]
|
|
|
|
A picture says a thousand words - so I invite you to take a look at the interactive graphs from the table
|
|
above. I'll cherrypick what I find the most interesting one here:
|
|
|
|
{{< image width="1000px" src="/assets/netgate-6100/bench-var2-64b-unidirectional.png" alt="Netgate 6100" >}}
|
|
|
|
The graph above is of the unidirectional _64b_ loadtest. Some observations:
|
|
* pfSense 21.05 (running FreeBSD 12.2, the bottom blue trace), and Ubuntu 20.04.3 (running Linux 5.13, the
|
|
orange trace just above it) are are equal performers. They handle fullsized (1514 byte) packets just fine,
|
|
struggle a little bit with imix, and completely suck at 64b packets (shown here), in particular if only 1
|
|
CPU core can be used.
|
|
* Even at 64b packets, VPP scales linearly from 33% of line rate with 1Q (the green trace), 66% with 2Q (the
|
|
red trace) and 96% with 3Q (the purple trace, that makes it through to the end).
|
|
* With VPP taking 3Q, one CPU is left over for the main thread and controlplane software like FRR or Bird2.
|
|
|
|
## Caveats
|
|
|
|
The unit was shipped courtesy of Netgate (thanks again! Jim, this was fun!) for the purposes of
|
|
load- and systems integration testing and comparing their internal benchmarking with my findings.
|
|
Other than that, this is not a paid endorsement and views of this review are my own.
|
|
|
|
One quirk I noticed is that while running VPP with 3Q and bidirectional traffic, performance is much worse
|
|
than with 2Q or 1Q. This is not a fluke with the loadtest, as I have observed the same strange performance
|
|
with other machines (Supermicro 5018D-FN8T for example). I confirmed that each VPP worker thread is used
|
|
for each queue, so I would've expected ~15Mpps shared by both interfaces (so a per-direction linerate of ~50%),
|
|
but I get 16.8% instead [[graphs](/assets/netgate-6100/netgate-6100.bench-bidirectional.html)]. I'll have to
|
|
understand that better, but for now I'm releasing the data as-is.
|
|
|
|
## Appendix
|
|
|
|
### Generating the data
|
|
|
|
You can find all of my loadtest runs in [this archive](/assets/netgate-6100/trex-loadtest-json.tar.gz).
|
|
The archive contains the `trex-loadtest.py` script as well, for curious readers!
|
|
These JSON files can be fed directly into Michal's [visualizer](https://github.com/wejn/trex-loadtest-viz)
|
|
to plot interactive graphs (which I've done for the table above):
|
|
|
|
```
|
|
DEVICE=netgate-6100
|
|
ruby graph.rb -t 'Netgate 6100 All Loadtests' -o ${DEVICE}.html netgate-*.json
|
|
for i in bench-var2-1514b bench-var2-64b bench imix; do
|
|
ruby graph.rb -t 'Netgate 6100 Unidirectional Loadtests' --only-channels 0 \
|
|
netgate-*-${i}-unidi*.json -o ${DEVICE}.$i-unidirectional.html
|
|
done
|
|
for i in bench-var2-1514b bench-var2-64b bench imix; do
|
|
ruby graph.rb -t 'Netgate 6100 Bidirectional Loadtests' \
|
|
netgate-*-${i}.json -o ${DEVICE}.$i-bidirectional.html
|
|
done
|
|
```
|
|
|
|
### Notes on pfSense
|
|
|
|
I'm not a pfSense user, but I know my way around FreeBSD just fine. After installing the firmware, I
|
|
simply choose the 'Give me a Shell' option, and take it from there. The router will run `pf` out of
|
|
the box, and it is pretty complex, so I'll just configure some addresses, routes and disable the
|
|
firewall alltogether. That sounds just fair, as the same tests with Linux and VPP also do not use
|
|
a firewall (even though obviously, both VPP and Linux support firewalls just fine).
|
|
|
|
```
|
|
ifconfig ix0 inet 100.65.1.1/24
|
|
ifconfig ix1 inet 100.65.2.1/24
|
|
route add -net 16.0.0.0/8 100.65.1.2
|
|
route add -net 48.0.0.0/8 100.65.2.2
|
|
pfctl -d
|
|
```
|
|
|
|
### Notes on Linux
|
|
|
|
When doing loadtests on Ubuntu, I have to ensure irqbalance is turned off, otherwise the kernel will
|
|
thrash around re-routing softirq's between CPU threads, and at the end of the day, I'm trying to saturate
|
|
all CPUs anyway, so balancing/moving them around doesn't make any sense. Further, Linux wants to configure
|
|
a static ARP entry for the interfaces from TRex:
|
|
|
|
```
|
|
sudo systemctl disable irqbalance
|
|
sudo systemctl stop irqbalance
|
|
sudo systemctl mask irqbalance
|
|
|
|
sudo ip addr add 100.65.1.1/24 dev enp3s0f0
|
|
sudo ip addr add 100.65.2.1/24 dev enp3s0f1
|
|
sudo ip nei replace 100.65.1.2 lladdr 68:05:ca:32:45:94 dev enp3s0f0 ## TRex port0
|
|
sudo ip nei replace 100.65.2.2 lladdr 68:05:ca:32:45:95 dev enp3s0f1 ## TRex port1
|
|
sudo ip ro add 16.0.0.0/8 via 100.65.1.2
|
|
sudo ip ro add 48.0.0.0/8 via 100.65.2.2
|
|
```
|
|
|
|
On Linux, I now see a reasonable spread of IRQs by CPU while doing a unidirectional loadtest:
|
|
```
|
|
root@netgate:/home/pim# cat /proc/softirqs
|
|
CPU0 CPU1 CPU2 CPU3
|
|
HI: 3 0 0 1
|
|
TIMER: 203788 247280 259544 401401
|
|
NET_TX: 8956 8373 7836 6154
|
|
NET_RX: 22003822 19316480 22526729 19430299
|
|
BLOCK: 2545 3153 2430 1463
|
|
IRQ_POLL: 0 0 0 0
|
|
TASKLET: 5084 60 1830 23
|
|
SCHED: 137647 117482 56371 49112
|
|
HRTIMER: 0 0 0 0
|
|
RCU: 11550 9023 8975 8075
|
|
```
|
|
|