Rewrite all images to Hugo format
This commit is contained in:
474
content/articles/2021-11-26-netgate-6100.md
Normal file
474
content/articles/2021-11-26-netgate-6100.md
Normal file
@ -0,0 +1,474 @@
|
||||
---
|
||||
date: "2021-11-26T08:51:23Z"
|
||||
title: 'Review: Netgate 6100'
|
||||
---
|
||||
|
||||
* Author: Pim van Pelt <[pim@ipng.nl](mailto:pim@ipng.nl)>
|
||||
* Reviewed: Jim Thompson <[jim@netgate.com](mailto:jim@netgate.com)>
|
||||
* Status: Draft - Review - **Approved**
|
||||
|
||||
A few weeks ago, Jim Thompson from Netgate stumbled across my [APU6 Post]({% post_url 2021-07-19-pcengines-apu6 %})
|
||||
and introduced me to their new desktop router/firewall the Netgate 6100. It currently ships with
|
||||
[pfSense Plus](https://www.netgate.com/pfsense-plus-software), but he mentioned that it's designed
|
||||
as well to run their [TNSR](https://www.netgate.com/tnsr) software, considering the device ships
|
||||
with 2x 1GbE SFP/RJ45 combo, 2x 10GbE SFP+, and 4x 2.5GbE RJ45 ports, and all network interfaces
|
||||
are Intel / DPDK capable chips. He asked me if I was willing to take it around the block with
|
||||
VPP, which of course I'd be happy to do, and here are my findings. The TNSR image isn't yet
|
||||
public for this device, but that's not a problem because [AS8298 runs VPP]({% post_url 2021-09-10-vpp-6 %}),
|
||||
so I'll just go ahead and install it myself ...
|
||||
|
||||
# Executive Summary
|
||||
|
||||
The Netgate 6100 router running pfSense has a single core performance of 623kpps and a total chassis
|
||||
throughput of 2.3Mpps, which is sufficient for line rate _in both directions_ at 1514b packets (1.58Mpps),
|
||||
about 6.2Gbit of _imix_ traffic, and about 419Mbit of 64b packets. Running Linux on the router yields
|
||||
very similar results.
|
||||
|
||||
With VPP though, the router's single core performance leaps to 5.01Mpps at 438 CPU cycles/packet. This
|
||||
means that all three of 1514b, _imix_ and 64b packets can be forwarded at line rate in one direction
|
||||
on 10Gbit. Due to its Atom C3558 processor (which has 4 cores, 3 of which are dedicated to VPP's worker
|
||||
threads, and 1 to its main thread and controlplane running in LInux), achieving 10Gbit line rate in
|
||||
both directions when using 64 byte packets, is not possible.
|
||||
|
||||
Running at 19W and a total forwarding **capacity of 15.1Mpps**, it consumes only **_1.26µJ_ of
|
||||
energy per forwarded packet**, while at the same time easily handling a full BGP table with room to
|
||||
spare. I find this Netgate 6100 appliance pretty impressive and when TNSR becomes available,
|
||||
performance will be similar to what I've tested here, at a pricetag of USD 699,-
|
||||
|
||||
## Detailed findings
|
||||
|
||||
{{< image width="400px" float="left" src="/assets/netgate-6100/netgate-6100-back.png" alt="Netgate 6100" >}}
|
||||
|
||||
The [Netgate 6100](https://www.netgate.com/blog/introducing-the-new-netgate-6100) ships with an
|
||||
Intel Atom C-3558 CPU (4 cores including AES-NI and QuickAssist), 8GB of memory and either 16GB of
|
||||
eMMC, or 128GB of NVME storage. The network cards are its main forté: it comes with 2x i354
|
||||
gigabit combo (SFP and RJ45), 4x i225 ports (these are 2.5GbE), and 2x X553 10GbE ports with an SFP+
|
||||
cage each, for a total of 8 ports and lots of connectivity.
|
||||
|
||||
The machine is fanless and this is made possible by its power efficient CPU: the Atom here runs at 16W
|
||||
TDP only, and the whole machine clocks in at a very efficient 19W. It comes with an external power brick,
|
||||
but only one power supply, so no redundancy, unfortunately. To make up for that small omission, here are
|
||||
a few nice touches that I noticed:
|
||||
* The power jack has a screw-on barrel - no more accidentally rebooting the machine when fumbling around under the desk.
|
||||
* There's both a Cisco RJ45 console port (115200,8n1), as well as a CP2102 onboard USB/serial connector,
|
||||
which means you can connect to its serial port as well with a standard issue micro-USB cable. Cool!
|
||||
|
||||
### Battle of Operating Systems
|
||||
|
||||
Netgate ships the device with pfSense - it's a pretty great appliance and massively popular - delivering
|
||||
firewall, router and VPN functionality to homes and small business across the globe. I myself am partial
|
||||
to BSD (albeit a bit more of the Puffy persuasion), but DPDK and VPP are more of a Linux cup of tea. So
|
||||
I'll have to deface this little guy, and reinstall it with Linux. My game plan is:
|
||||
|
||||
1. Based on the shipped pfSense 21.05 (FreeBSD 12.2), do all the loadtests
|
||||
1. Reinstall the machine with Linux (Ubuntu 20.04.3), do all the loadtests
|
||||
1. Install VPP using my own [HOWTO]({% post_url 2021-09-21-vpp-7 %}), and do all the loadtests
|
||||
|
||||
This allows for, I think, a pretty sweet comparison between FreeBSD, Linux, and DPDK/VPP. Now, on to a
|
||||
description on the defacing, err, reinstall process on this Netgate 6100 machine, as it was not as easy
|
||||
as I had anticipated (but is it ever easy, really?)
|
||||
|
||||
{{< image width="400px" float="right" src="/assets/netgate-6100/blinkboot.png" alt="Blinkboot" >}}
|
||||
|
||||
Turning on the device, it presents me with some BIOS firmware from Insyde Software which
|
||||
is loading some software called _BlinkBoot_ [[ref](https://www.insyde.com/products/blinkboot)], which
|
||||
in turn is loading modules called _Lenses_, pictured right. Anyway, this ultimately presents
|
||||
me with a ***Press F2 for Boot Options***. Aha! That's exactly what I'm looking for. I'm really
|
||||
grateful that Netgate decides to ship a device with a BIOS that will allow me to boot off of other
|
||||
media, notably the USB stick in order to [reinstall pfSense](https://docs.netgate.com/pfsense/en/latest/solutions/netgate-6100/reinstall-pfsense.html)
|
||||
but in my case, also to install another operating system entirely.
|
||||
|
||||
My first approach was to get a default image to boot off of USB (the device has two USB3 ports on the
|
||||
side). But none of the USB ports want to load my UEFI `bootx64.efi` prepared USB key. So my second
|
||||
attempt was to prepare a PXE boot image, taking a few hints from Ubuntu's documentation [[ref](https://wiki.ubuntu.com/UEFI/PXE-netboot-install)]:
|
||||
|
||||
```
|
||||
wget http://archive.ubuntu.com/ubuntu/dists/focal/main/installer-amd64/current/legacy-images/netboot/mini.iso
|
||||
mv mini.iso /tmp/mini-focal.iso
|
||||
grub-mkimage --format=x86_64-efi \
|
||||
--output=/var/tftpboot/grubnetx64.efi.signed \
|
||||
--memdisk=/tmp/mini-focal.iso \
|
||||
`ls /usr/lib/grub/x86_64-efi | sed -n 's/\.mod//gp'`
|
||||
```
|
||||
|
||||
After preparing DHCPd and a TFTP server, and getting a slight feeling of being transported back in time
|
||||
to the stone age, I see the PXE both request an IPv4 address, and the image I prepared. And, it boots, yippie!
|
||||
|
||||
```
|
||||
Nov 25 14:52:10 spongebob dhcpd[43424]: DHCPDISCOVER from 90:ec:77:1b:63:55 via bond0
|
||||
Nov 25 14:52:11 spongebob dhcpd[43424]: DHCPOFFER on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0
|
||||
Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPREQUEST for 192.168.1.206 (192.168.1.254) from 90:ec:77:1b:63:55 via bond0
|
||||
Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPACK on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0
|
||||
Nov 25 15:04:56 spongebob tftpd[2076]: 192.168.1.206: read request for 'grubnetx64.efi.signed'
|
||||
```
|
||||
|
||||
I took a peek while the `grubnetx64` was booting, and saw that the available output terminals
|
||||
on this machine are `spkmodem morse gfxterm serial_efi0 serial_com0 serial cbmemc audio`, and that
|
||||
the default/active one is `console`, so I make a note that Grub wants to run on 'console' (and
|
||||
specifically NOT on 'serial', as is usual, see below for a few more details on this) while the Linux
|
||||
kernel will of course be running on serial, so I have to add `console=ttyS0,115200n8` to the kernel boot
|
||||
string before booting.
|
||||
|
||||
Piece of cake, by which I mean I spent about four hours staring at the boot loader and failing to get
|
||||
it quite right -- pro-tip: install OpenSSH and fix the GRUB and Kernel configs before finishing the
|
||||
`mini.iso` install:
|
||||
|
||||
```
|
||||
mount --bind /proc /target/proc
|
||||
mount --bind /dev /target/dev
|
||||
mount --bind /sys /target/sys
|
||||
chroot /target /bin/bash
|
||||
|
||||
# Install OpenSSH, otherwise the machine boots w/o access :)
|
||||
apt update
|
||||
apt install openssh-server
|
||||
|
||||
# Fix serial for GRUB and Kernel
|
||||
vi /etc/default/grub
|
||||
## set GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0,115200n8"
|
||||
## set GRUB_TERMINAL=console (and comment out the serial stuff)
|
||||
grub-install /dev/sda
|
||||
update-grub
|
||||
```
|
||||
|
||||
Rebooting now brings me to Ubuntu: Pat on the back, Caveman Pim, you've still got it!
|
||||
|
||||
## Network Loadtest
|
||||
|
||||
After that small but exciting detour, let me get back to the loadtesting. The choice of Intel's
|
||||
network controller on this board allows me to use Intel's DPDK with relatively high
|
||||
performance, compared to regular (kernel) based routing. I loadtested the stock firmware pfSense
|
||||
(21.05, based on FreeBSD 12.2), Linux (Ubuntu 20.04.3), and VPP (22.02, [[ref](https://fd.io/)]).
|
||||
|
||||
Specifically worth calling out is that while Linux and FreeBSD struggled in the packets-per-second
|
||||
department, the use of DPDK in VPP meant absolutely no problems filling a unidirectional 10G stream
|
||||
of "regular internet traffic" (referred to as `imix`), it was also able to fill _line rate_ with
|
||||
"64b UDP packets", with just a little headroom there, but it ultimately struggled with _bidirectional_
|
||||
64b UDP packets.
|
||||
|
||||
### Methodology
|
||||
|
||||
For the loadtests, I used Cisco's T-Rex [[ref](https://trex-tgn.cisco.com/)] in stateless mode,
|
||||
with a custom Python controller that ramps up and down traffic from the loadtester to the _device
|
||||
under test_ (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be
|
||||
presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back
|
||||
in on `port0`). The loadtester first sends a few seconds of warmup, this is to ensure the DUT is
|
||||
passing traffic and offers the ability to inspect the traffic before the actual rampup. Then
|
||||
the loadtester ramps up linearly from zero to 100% of line rate (in this case, line rate is
|
||||
10Gbps in both directions), finally it holds the traffic at full line rate for a certain
|
||||
duration. If at any time the loadtester fails to see the traffic it's emitting return on its
|
||||
second port, it flags the DUT as saturated; and this is noted as the maximum bits/second and/or
|
||||
packets/second.
|
||||
|
||||
Since my last loadtesting [post]({% post_url 2021-07-19-pcengines-apu6 %}), I've learned a lot
|
||||
more about packet forwarding and how to make it easier or harder on the router. Let me go into a
|
||||
few more details about the various loadtests that I've done here.
|
||||
|
||||
#### Method 1: Single CPU Thread Saturation
|
||||
|
||||
Most kernels (certainly OpenBSD, FreeBSD and Linux) will make use of multiple receive queues
|
||||
if the network card supports it. The Intel NICs in this machine are all capable of _Receive Side
|
||||
Scaling_ (RSS), which means the NIC can offload its packets into multiple queues. The kernel
|
||||
will typically enable one queue for each CPU core -- the Atom has 4 cores, so 4 queues are
|
||||
initialized, and inbound traffic is sent, typically using some hashing function, to individual
|
||||
CPUs, allowing for a higher aggregate throughput.
|
||||
|
||||
Mostly, this hashing function is based on some L3 or L4 payload, for example a hash over
|
||||
the source IP/port and destination IP/port. So one interesting test is to send **the same packet**
|
||||
over and over again -- the hash function will then return the same value for each packet, which
|
||||
means all traffic goes into exactly one of the N available queues, and therefore handled by
|
||||
only one core.
|
||||
|
||||
One such TRex stateless traffic profile is `udp_1pkt_simple.py` which, as the name implies,
|
||||
simply sends the same UDP packet from source IP/port and destination IP/port, padded with
|
||||
a bunch of 'x' characters, over and over again:
|
||||
|
||||
```
|
||||
packet = STLPktBuilder(pkt =
|
||||
Ether()/
|
||||
IP(src="16.0.0.1",dst="48.0.0.1")/
|
||||
UDP(dport=12,sport=1025)/(10*'x')
|
||||
)
|
||||
```
|
||||
|
||||
#### Method 2: Rampup using trex-loadtest.py
|
||||
|
||||
TRex ships with a very handy `bench.py` stateless traffic profile which, without any additional
|
||||
arguments, does the same thing as the above method. However, this profile optionally takes a few
|
||||
arguments, which are called _tunables_, notably:
|
||||
* ***size*** - set the size of the packets to either a number (ie. 64, the default, or any number
|
||||
up to a maximum of 9216 byes), or the string `imix` which will send a traffic mix consisting of
|
||||
60b, 590b and 1514b packets.
|
||||
* ***vm*** - set the packet source/dest generation. By default (when the flag is `None`), the
|
||||
same src (16.0.0.1) and dst (48.0.0.1) is set for each packet. When setting the value to
|
||||
`var1`, the source IP is incremented from `16.0.0.[4-254]`. If the value is set to
|
||||
`var2`, the source _and_ destination IP are incremented, the destination from `48.0.0.[4-254]`.
|
||||
|
||||
So tinkering with the `vm` parameter is an excellent way of driving one or many receive queues. Armed
|
||||
with this, I will perform a loadtest with four modes of interest, from easier to more challenging:
|
||||
1. ***bench-var2-1514b***: multiple flows, ~815Kpps at 10Gbps; this is the easiest test to perform,
|
||||
as the traffic consists of large (1514 byte) packets, and both source and destination are
|
||||
different each time, which means lots of multiplexing across receive queues, and relatively few
|
||||
packets/sec.
|
||||
1. ***bench-var2-imix***: multiple flows, with a mix of 60, 590 and 1514b frames in a certain ratio. This
|
||||
yields what can be reasonably expected from _normal internet use_, just about 3.2Mpps at
|
||||
10Gbps. This is the most representative test for normal use, but still the packet rate is quite
|
||||
low due to (relatively) large packets. Any respectable router should be able to perform well at
|
||||
an imix profile.
|
||||
1. ***bench-var2-64b***: Still multiple flows, but very small packets, 14.88Mpps at 10Gbps,
|
||||
often refered to as the theoretical maximum throughput on Tengig. Now it's getting harder, as
|
||||
the loadtester will fill the line with small packets (of 64 bytes, the smallest that an ethernet
|
||||
packet is allowed to be). This is a good way to see if the router vendor is actually capable of
|
||||
what is referred to as _line rate_ forwarding.
|
||||
1. ***bench***: Now restricted to a constant src/dst IP:port tuple, and the same rate of
|
||||
14.88Mpps at 10Gbps, means only one Rx queue (and thus, one CPU core) can be used. This is where
|
||||
single-core performance becomes relevant. Notably, vendors who boast many CPUs, will often struggle
|
||||
with a test like this, in case any given CPU core cannot individually handle a full line rate.
|
||||
I'm looking at you, Tilera!
|
||||
|
||||
Further to this list, I can send traffic in one direction only (TRex will emit this from its
|
||||
port0 and expect the traffic to be seen back at port1); or I can send it ***in both directions***.
|
||||
The latter will double the packet rate and bandwidth, to approx 29.7Mpps.
|
||||
|
||||
***NOTE***: At these rates, TRex can be a bit fickle trying to fit all these packets into its own
|
||||
transmit queues, so I decide to drive it a bit less close to the cliff and stop at 97% of line rate
|
||||
(this is 28.3Mpps). It explains why lots of these loadtests top out at that number.
|
||||
|
||||
### Results
|
||||
|
||||
#### Method 1: Single CPU Thread Saturation
|
||||
|
||||
Given the approaches above, for the first method I can "just" saturate the line and see how many packets
|
||||
emerge through the DUT on the other port, so that's only 3 tests:
|
||||
|
||||
Netgate 6100 | Loadtest | Throughput (pps) | L1 Throughput (bps) | % of linerate
|
||||
------------- | --------------- | ---------------- | -------------------- | -------------
|
||||
pfSense | 64b 1-flow | 622.98 Kpps | 418.65 Mbps | 4.19%
|
||||
Linux | 64b 1-flow | 642.71 Kpps | 431.90 Mbps | 4.32%
|
||||
***VPP*** | ***64b 1-flow***| ***5.01 Mpps*** | ***3.37 Gbps*** | ***33.67%***
|
||||
|
||||
***NOTE***: The bandwidth figures here are so called _L1 throughput_ which means bits on the wire, as opposed
|
||||
to _L2 throughput_ which means bits in the ethernet frame. This is relevant particularly at 64b loadtests as
|
||||
the overhead for each ethernet frame is 20 bytes (7 bytes preamble, 1 byte start-frame, and 12 bytes inter-frame gap
|
||||
[[ref](https://en.wikipedia.org/wiki/Ethernet_frame)]). At 64 byte frames, this is 31.25% overhead! It also
|
||||
means that when L1 bandwidth is fully consumed at 10Gbps, that the observed L2 bandwidth will be only 7.62Gbps.
|
||||
|
||||
#### Interlude - VPP efficiency
|
||||
|
||||
In VPP it can be pretty cool to take a look at efficiency -- one of the main reasons why it's so quick is because
|
||||
VPP will consume the entire core, and grab ***a set of packets*** from the NIC rather than do work for each
|
||||
individual packet. VPP then advances the set of packets, called a _vector_, through a directed graph. The first
|
||||
of these packets will result in the code for the current graph node to be fetched into the CPU's instruction cache,
|
||||
and the second and further packets will make use of the warmed up cache, greatly improving per-packet efficiency.
|
||||
|
||||
I can demonstrate this by running a 1kpps, 1Mpps and 10Mpps test against the VPP install on this router, and
|
||||
observing how many CPU cycles each packet needs to get forwarded from the input interface to the output interface.
|
||||
I expect this number _to go down_ when the machine has more work to do, due to the higher CPU i/d-cache hit rate.
|
||||
Seeing the time spent in each of VPP's graph nodes, and for each individual worker thread (which correspond 1:1
|
||||
with CPU cores), can be done with `vppctl show runtime` command and some `awk` magic:
|
||||
|
||||
```
|
||||
$ vppctl clear run && sleep 30 && vppctl show run | \
|
||||
awk '$2 ~ /active|polling/ && $4 > 25000 {
|
||||
print $0;
|
||||
if ($1=="ethernet-input") { packets = $4};
|
||||
if ($1=="dpdk-input") { dpdk_time = $6};
|
||||
total_time += $6
|
||||
} END {
|
||||
print packets/30, "packets/sec, at",total_time,"cycles/packet,",
|
||||
total_time-dpdk_time,"cycles/packet not counting DPDK"
|
||||
}'
|
||||
```
|
||||
|
||||
This gives me the following, somewhat verbose but super interesting output, which I've edited down to fit on screen,
|
||||
and omit the columns that are not super relevant. Ready? Here we go!
|
||||
|
||||
```
|
||||
tui>start -f stl/udp_1pkt_simple.py -p 0 -m 1kpps
|
||||
Graph Node Name Clocks Vectors/Call
|
||||
----------------------------------------------------------------
|
||||
TenGigabitEthernet3/0/1-output 6.07e2 1.00
|
||||
TenGigabitEthernet3/0/1-tx 8.61e2 1.00
|
||||
dpdk-input 1.51e6 0.00
|
||||
ethernet-input 1.22e3 1.00
|
||||
ip4-input-no-checksum 6.59e2 1.00
|
||||
ip4-load-balance 4.50e2 1.00
|
||||
ip4-lookup 5.63e2 1.00
|
||||
ip4-rewrite 5.83e2 1.00
|
||||
1000.17 packets/sec, at 1514943 cycles/packet, 4943 cycles/pkt not counting DPDK
|
||||
```
|
||||
|
||||
I'll observe that a lot of time is spent in `dpdk-input`, because that is a node that is constantly polling
|
||||
the network card, as fast as it can, to see if there's any work for it to do. Apparently not, because the average
|
||||
vectors per call is pretty much zero, and considering that, most of the CPU time is going to sit in a nice "do
|
||||
nothing". Because reporting CPU cycles spent doing nothing isn't particularly interesting, I shall report on
|
||||
both the total cycles spent, that is to say including DPDK, and as well the cycles spent per packet in the
|
||||
_other active_ nodes. In this case, at 1kpps, VPP is spending 4953 cycles on each packet.
|
||||
|
||||
Now, take a look what happens when I raise the traffic to 1Mpps:
|
||||
|
||||
```
|
||||
tui>start -f stl/udp_1pkt_simple.py -p 0 -m 1mpps
|
||||
Graph Node Name Clocks Vectors/Call
|
||||
----------------------------------------------------------------
|
||||
TenGigabitEthernet3/0/1-output 3.80e1 18.57
|
||||
TenGigabitEthernet3/0/1-tx 1.44e2 18.57
|
||||
dpdk-input 1.15e3 .39
|
||||
ethernet-input 1.39e2 18.57
|
||||
ip4-input-no-checksum 8.26e1 18.57
|
||||
ip4-load-balance 5.85e1 18.57
|
||||
ip4-lookup 7.94e1 18.57
|
||||
ip4-rewrite 7.86e1 18.57
|
||||
981830 packets/sec, at 1770.1 cycles/packet, 620 cycles/pkt not counting DPDK
|
||||
```
|
||||
|
||||
Whoa! The system is now running the VPP loop with ~18.6 packets per vector, and you can clearly see that
|
||||
the CPU efficiency went up greatly, from 4953 cycles/packet at 1kpps, to 620 cycles/packet at 1Mpps.
|
||||
That's an order of magnitude improvement!
|
||||
|
||||
Finally, let's give this Netgate 6100 router a run for its money, and slam it with 10Mpps:
|
||||
|
||||
```
|
||||
tui>start -f stl/udp_1pkt_simple.py -p 0 -m 10mpps
|
||||
Graph Node Name Clocks Vectors/Call
|
||||
----------------------------------------------------------------
|
||||
TenGigabitEthernet3/0/1-output 1.41e1 256.00
|
||||
TenGigabitEthernet3/0/1-tx 1.23e2 256.00
|
||||
dpdk-input 7.95e1 256.00
|
||||
ethernet-input 6.74e1 256.00
|
||||
ip4-input-no-checksum 3.95e1 256.00
|
||||
ip4-load-balance 2.54e1 256.00
|
||||
ip4-lookup 4.12e1 256.00
|
||||
ip4-rewrite 4.78e1 256.00
|
||||
5.01426e+06 packets/sec, at 437.9 cycles/packet, 358 cycles/pkt not counting DPDK
|
||||
```
|
||||
|
||||
And here is where I learn the maximum packets/sec that this one CPU thread can handle: ***5.01Mpps***, at which
|
||||
point every packet is super efficiently handled at 358 CPU cycles each, or 13.8 times (4953/438)
|
||||
as efficient under high load than when the CPU is unloaded. Sweet!!
|
||||
|
||||
Another really cool thing to do here is derive the effective clock speed of the Atom CPU. We know it runs at
|
||||
2200Mhz, and we're doing 5.01Mpps at 438 cycles/packet including the time spent in DPDK, which adds up to 2194MHz,
|
||||
remarkable precision. Color me impressed :-)
|
||||
|
||||
#### Method 2: Rampup using trex-loadtest.py
|
||||
|
||||
For the second methodology, I have to perform a _lot_ of loadtests. In total, I'm testing 4 modes (1514b, imix,
|
||||
64b-multi and 64b 1-flow), then take a look at unidirectional traffic and bidirectional traffic, and perform each
|
||||
of these loadtests on pfSense, Ubuntu, and VPP with one, two or three Rx/Tx queues. That's a total of 40
|
||||
loadtests!
|
||||
|
||||
Loadtest | pfSense | Ubuntu | VPP 1Q | VPP 2Q | VPP 3Q | Details
|
||||
--------- | -------- | -------- | ------- | ------- | ------- | ----------
|
||||
***Unidirectional*** |
|
||||
1514b | 97% | 97% | 97% | 97% | ***97%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-1514b-unidirectional.html)]
|
||||
imix | 61% | 75% | 96% | 95% | ***95%*** | [[graphs](/assets/netgate-6100/netgate-6100.imix-unidirectional.html)]
|
||||
64b | 15% | 17% | 33% | 66% | ***96%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-64b-unidirectional.html)]
|
||||
64b 1-flow| 4.4% | 4.7% | 33% | 33% | ***33%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-unidirectional.html)]
|
||||
***Bidirectional*** |
|
||||
1514b | 192% | 193% | 193% | 193% | ***194%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-1514b-bidirectional.html)]
|
||||
imix | 63% | 71% | 190% | 190% | ***191%*** | [[graphs](/assets/netgate-6100/netgate-6100.imix-bidirectional.html)]
|
||||
64b | 15% | 16% | 61% | 63% | ***81%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-64b-bidirectional.html)]
|
||||
64b 1-flow| 8.6% | 9.0% | 61% | ***61%*** | 33% (+) | [[graphs](/assets/netgate-6100/netgate-6100.bench-bidirectional.html)]
|
||||
|
||||
A picture says a thousand words - so I invite you to take a look at the interactive graphs from the table
|
||||
above. I'll cherrypick what I find the most interesting one here:
|
||||
|
||||
{{< image width="1000px" src="/assets/netgate-6100/bench-var2-64b-unidirectional.png" alt="Netgate 6100" >}}
|
||||
|
||||
The graph above is of the unidirectional _64b_ loadtest. Some observations:
|
||||
* pfSense 21.05 (running FreeBSD 12.2, the bottom blue trace), and Ubuntu 20.04.3 (running Linux 5.13, the
|
||||
orange trace just above it) are are equal performers. They handle fullsized (1514 byte) packets just fine,
|
||||
struggle a little bit with imix, and completely suck at 64b packets (shown here), in particular if only 1
|
||||
CPU core can be used.
|
||||
* Even at 64b packets, VPP scales linearly from 33% of line rate with 1Q (the green trace), 66% with 2Q (the
|
||||
red trace) and 96% with 3Q (the purple trace, that makes it through to the end).
|
||||
* With VPP taking 3Q, one CPU is left over for the main thread and controlplane software like FRR or Bird2.
|
||||
|
||||
## Caveats
|
||||
|
||||
The unit was shipped courtesy of Netgate (thanks again! Jim, this was fun!) for the purposes of
|
||||
load- and systems integration testing and comparing their internal benchmarking with my findings.
|
||||
Other than that, this is not a paid endorsement and views of this review are my own.
|
||||
|
||||
One quirk I noticed is that while running VPP with 3Q and bidirectional traffic, performance is much worse
|
||||
than with 2Q or 1Q. This is not a fluke with the loadtest, as I have observed the same strange performance
|
||||
with other machines (Supermicro 5018D-FN8T for example). I confirmed that each VPP worker thread is used
|
||||
for each queue, so I would've expected ~15Mpps shared by both interfaces (so a per-direction linerate of ~50%),
|
||||
but I get 16.8% instead [[graphs](/assets/netgate-6100/netgate-6100.bench-bidirectional.html)]. I'll have to
|
||||
understand that better, but for now I'm releasing the data as-is.
|
||||
|
||||
## Appendix
|
||||
|
||||
### Generating the data
|
||||
|
||||
You can find all of my loadtest runs in [this archive](/assets/netgate-6100/trex-loadtest-json.tar.gz).
|
||||
The archive contains the `trex-loadtest.py` script as well, for curious readers!
|
||||
These JSON files can be fed directly into Michal's [visualizer](https://github.com/wejn/trex-loadtest-viz)
|
||||
to plot interactive graphs (which I've done for the table above):
|
||||
|
||||
```
|
||||
DEVICE=netgate-6100
|
||||
ruby graph.rb -t 'Netgate 6100 All Loadtests' -o ${DEVICE}.html netgate-*.json
|
||||
for i in bench-var2-1514b bench-var2-64b bench imix; do
|
||||
ruby graph.rb -t 'Netgate 6100 Unidirectional Loadtests' --only-channels 0 \
|
||||
netgate-*-${i}-unidi*.json -o ${DEVICE}.$i-unidirectional.html
|
||||
done
|
||||
for i in bench-var2-1514b bench-var2-64b bench imix; do
|
||||
ruby graph.rb -t 'Netgate 6100 Bidirectional Loadtests' \
|
||||
netgate-*-${i}.json -o ${DEVICE}.$i-bidirectional.html
|
||||
done
|
||||
```
|
||||
|
||||
### Notes on pfSense
|
||||
|
||||
I'm not a pfSense user, but I know my way around FreeBSD just fine. After installing the firmware, I
|
||||
simply choose the 'Give me a Shell' option, and take it from there. The router will run `pf` out of
|
||||
the box, and it is pretty complex, so I'll just configure some addresses, routes and disable the
|
||||
firewall alltogether. That sounds just fair, as the same tests with Linux and VPP also do not use
|
||||
a firewall (even though obviously, both VPP and Linux support firewalls just fine).
|
||||
|
||||
```
|
||||
ifconfig ix0 inet 100.65.1.1/24
|
||||
ifconfig ix1 inet 100.65.2.1/24
|
||||
route add -net 16.0.0.0/8 100.65.1.2
|
||||
route add -net 48.0.0.0/8 100.65.2.2
|
||||
pfctl -d
|
||||
```
|
||||
|
||||
### Notes on Linux
|
||||
|
||||
When doing loadtests on Ubuntu, I have to ensure irqbalance is turned off, otherwise the kernel will
|
||||
thrash around re-routing softirq's between CPU threads, and at the end of the day, I'm trying to saturate
|
||||
all CPUs anyway, so balancing/moving them around doesn't make any sense. Further, Linux wants to configure
|
||||
a static ARP entry for the interfaces from TRex:
|
||||
|
||||
```
|
||||
sudo systemctl disable irqbalance
|
||||
sudo systemctl stop irqbalance
|
||||
sudo systemctl mask irqbalance
|
||||
|
||||
sudo ip addr add 100.65.1.1/24 dev enp3s0f0
|
||||
sudo ip addr add 100.65.2.1/24 dev enp3s0f1
|
||||
sudo ip nei replace 100.65.1.2 lladdr 68:05:ca:32:45:94 dev enp3s0f0 ## TRex port0
|
||||
sudo ip nei replace 100.65.2.2 lladdr 68:05:ca:32:45:95 dev enp3s0f1 ## TRex port1
|
||||
sudo ip ro add 16.0.0.0/8 via 100.65.1.2
|
||||
sudo ip ro add 48.0.0.0/8 via 100.65.2.2
|
||||
```
|
||||
|
||||
On Linux, I now see a reasonable spread of IRQs by CPU while doing a unidirectional loadtest:
|
||||
```
|
||||
root@netgate:/home/pim# cat /proc/softirqs
|
||||
CPU0 CPU1 CPU2 CPU3
|
||||
HI: 3 0 0 1
|
||||
TIMER: 203788 247280 259544 401401
|
||||
NET_TX: 8956 8373 7836 6154
|
||||
NET_RX: 22003822 19316480 22526729 19430299
|
||||
BLOCK: 2545 3153 2430 1463
|
||||
IRQ_POLL: 0 0 0 0
|
||||
TASKLET: 5084 60 1830 23
|
||||
SCHED: 137647 117482 56371 49112
|
||||
HRTIMER: 0 0 0 0
|
||||
RCU: 11550 9023 8975 8075
|
||||
```
|
||||
|
Reference in New Issue
Block a user