Files
ipng.ch/content/articles/2024-07-05-r86s.md

569 lines
35 KiB
Markdown

---
date: "2024-07-05T12:51:23Z"
title: 'Review: R86S (Jasper Lake - N6005)'
aliases:
- /s/articles/2024/07/05/r86s.html
---
# Introduction
{{< image width="250px" float="right" src="/assets/r86s/r86s-front.png" alt="R86S Front" >}}
I am always interested in finding new hardware that is capable of running VPP. Of course, a standard
issue 19" rack mountable machine like a Dell, HPE or SuperMicro machine is an obvious choice. They
come with redundant power supplies, PCIe v3.0 or better expansion slots, and can boot off of mSATA
or NVME, with plenty of RAM. But for some people and in some locations, the power envelope or
size/cost of these 19" rack mountable machines can be prohibitive. Sometimes, just having a smaller
form factor can be very useful:
***Enter the GoWin R86S!***
{{< image width="250px" float="right" src="/assets/r86s/r86s-nvme.png" alt="R86S NVME" >}}
I stumbled across this lesser known build from GoWin, which is an ultra compact but modern design,
featuring three 2.5GbE ethernet ports and optionally two 10GbE, or as I'll show here, two 25GbE
ports. What I really liked about the machine is that it comes with 32GB of LPDDR4 memory and can boot
off of an m.2 NVME -- which makes it immediately an appealing device to put in the field. I
noticed that the height of the machine is just a few millimeters smaller than 1U which is 1.75"
(44.5mm), which gives me the bright idea to 3D print a bracket to be able to rack these and because
they are very compact -- a width of 78mm only, I can manage to fit four of them in one 1U front, or
maybe a Mikrotik CRS305 breakout switch. Slick!
{{< image width="250px" float="right" src="/assets/r86s/r86s-ocp.png" alt="R86S OCP" >}}
I picked up two of these _R86S Pro_ and when they arrived, I noticed that their 10GbE is actually
an _Open Compute Project_ (OCP) footprint expansion card, which struck me as clever. It means that I
can replace the Mellanox `CX342A` network card with perhaps something more modern, such as an Intel
`X520-DA2` or Mellanox `MCX542B_ACAN` which is even dual-25G! So I take to ebay and buy myself a few
expansion OCP boards, which are surprisingly cheap, perhaps because the OCP form factor isn't as
popular as 'normal' PCIe v3.0 cards.
I put a Google photos album online [[here](https://photos.app.goo.gl/gPMAp21FcXFiuNaH7)], in case
you'd like some more detailed shots.
In this article, I'll write about a mixture of hardware, systems engineering (how the hardware like
network cards and motherboard and CPU interact with one another), and VPP performance diagnostics.
I hope that it helps a few wary Internet denizens feel their way around these challenging but
otherwise fascinating technical topics. Ready? Let's go!
# Hardware Specs
{{< image width="250px" float="right" src="/assets/r86s/nics.png" alt="NICs" >}}
For the CHF 314,- I paid for each Intel Pentium N6005, this machine is delightful! They feature:
* Intel Pentium Silver N6005 @ 2.00GHz (4 cores)
* 2x16GB Micron LPDDR4 memory @2933MT/s
* 1x Samsung SSD 980 PRO 1TB NVME
* 3x Intel I226-V 2.5GbE network ports
* 1x OCP v2.0 connector with PCIe v3.0 x4 delivered
* USB-C power supply
* 2x USB3 (one on front, one on side)
* 1x USB2 (on the side)
* 1x MicroSD slot
* 1x MicroHDMI video out
* Wi-Fi 6 AX201 160MHz onboard
To the right I've put the three OCP nework interface cards side by side. On the top, the Mellanox
Cx3 (2x10G) that shipped with the R86S units. In the middle, a spiffy Mellanox Cx5 (2x25G), and at
the bottom, the _classic_ Intel 82599ES (2x10G) card. As I'll demonstrate, despite having the same
form factor, each of these have a unique story to tell, well beyond their rated portspeed.
There's quite a few options for CPU out there - GoWin sells them with Jasper Lake (Celeron N5105
or Pentium N6005, the one I bought), but also with newer Alder Lake (N100 or N305). Price,
performance and power draw will vary. I looked at a few differences in Passmark, and I think I made
a good trade off between cost, power and performance. You may of course choose differently!
The R86S formfactor is very compact, coming in at (80mm x 120mm x 40mm), and the case is made of
sturdy aluminium. It feels like a good quality build, and the inside is also pretty neat. In the
kit, a cute little M2 hex driver is included. This allows me to remove the bottom plate (to service
the NVME) and separate the case to access the OCP connector (and replace the NIC!). Finally, the two
antennae at the back are tri-band, suitable for WiFi 6. There is one fan included in the chassis,
with a few cut-outs in the top of the case, to let the air flow through the case. The fan is not
noisy, but definitely noticeable.
## Compiling VPP on R86S
I first install Debian Bookworm on them, and retrofit one of them with the Intel X520 and the other
with the Mellanox Cx5 network cards. While the Mellanox Cx342A that comes with the R86S does have
DPDK support (using the MLX4 poll mode driver), it has a quirk in that it does not enumerate both
ports as unique PCI devices, causing VPP to crash with duplicate graph node names:
```
vlib_register_node:418: more than one node named `FiftySixGigabitEthernet5/0/0-tx'
Failed to save post-mortem API trace to /tmp/api_post_mortem.794
received signal SIGABRT, PC 0x7f9445aa9e2c
```
The way VPP enumerates DPDK devices is by walking the PCI bus, but considering the Connect-X3 has
two ports behind the same PCI address, it'll try to create two interfaces, which fails. It's pretty
easily fixable with a small [[patch](/assets/r86s/vpp-cx3.patch)]. Off I go, to compile VPP (version
`24.10-rc0~88-ge3469369dd`) with Mellanox DPDK support, to get the best side by side comparison
between the Cx3 and X520 cards on the one hand needing DPDK, and the Cx5 card optionally also being
able to use VPP's RDMA driver. They will _all_ be using DPDK in my tests.
I'm not out of the woods yet, because VPP throws an error when enumerating and attaching the
Mellanox Cx342. I read the DPDK documentation for this poll mode driver
[[ref](https://doc.dpdk.org/guides/nics/mlx4.html)] and find that when using DPDK applications, the
`mlx4_core` driver in the kernel has to be initialized with a specific flag, like so:
```
GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=1-3 iommu=on intel_iommu=on mlx4_core.log_num_mgm_entry_size=-1"
```
And because I'm using `iommu`, the correct driver to load for Cx3 is `vfio_pci`, so I put that in
`/etc/modules`, rebuild the initrd, and reboot the machine. With all of that sleuthing out of the
way, I am now ready to take the R86S out for a spin and see how much this little machine is capable
of forwarding as a router.
### Power: Idle and Under Load
I note that the Intel Pentium Silver CPU has 4 cores, one of which will be used by OS and
controlplane, leaving 3 worker threads left for VPP. The Pentium Silver N6005 comes with 32kB of L1
per core, and 1.5MB of L2 + 4MB of L3 cache shared between the cores. It's not much, but then again
the TDP is shockingly low 10 Watts. Before VPP runs (and makes the CPUs work really hard), the
entire machine idles at 12 Watts. When powered on under full load, the Mellanox Cx3 and Intel
x520-DA2 both sip 17 Watts of power and the Mellanox Cx5 slurps 20 Watts of power all-up. Neat!
## Loadtest Results
{{< image width="400px" float="right" src="/assets/r86s/loadtest.png" alt="Loadtest" >}}
For each network interface I will do a bunch of loadtests, to show different aspects of the setup.
First, I'll do a bunch of unidirectional tests, where traffic goes into one port and exits another.
I will do this with either large packets (1514b), small packets (64b) but many flows, which allow me
to use multiple hardware receive queues assigned to individual worker threads, or small packets with
only one flow, limiting VPP to only one RX queue and consequently only one CPU thread. Because I
think it's hella cool, I will also loadtest MPLS label switching (eg. MPLS frame with label '16' on
ingress, forwarded with a swapped label '17' on egress). In general, MPLS lookups can be a bit
faster as they are (constant time) hashtable lookups, while IPv4 longest prefix match lookups use a
trie. MPLS won't be significantly faster than IPv4 in these tests, because the FIB is tiny with
only a handful of entries.
Second, I'll do the same loadtests but in both directions, which means traffic is both entering NIC0
and being emitted on NIC1, but also entering on NIC1 to be emitted on NIC0. In these loadtests,
again large packets, small packets multi-flow, small packets single-flow, and MPLS, the network chip
has to do more work to maintain its RX queues *and* its TX queues simultaneously. As I'll
demonstrate, this tends to matter quite a bit on consumer hardware.
### Intel i226-V (2.5GbE)
This is a 2.5G network interface from the _Foxville_ family, released in Q2 2022 with a ten year
expected availability, it's currently a very good choice. It is a consumer/client chip, which means
I cannot expect super performance from it. In this machine, the three RJ45 ports are connected to
PCI slot 01:00.0, 02:00.0 and 03:00.0, each at 5.0GT/s (this means they are PCIe v2.0) and they take
one x1 PCIe lane to the CPU. I leave the first port as management, and take the second+third one and
give them to VPP like so:
```
dpdk {
dev 0000:02:00.0 { name e0 }
dev 0000:03:00.0 { name e1 }
no-multi-seg
decimal-interface-names
uio-driver vfio-pci
}
```
The logical configuration then becomes:
```
set int state e0 up
set int state e1 up
set int ip address e0 100.64.1.1/30
set int ip address e1 100.64.2.1/30
ip route add 16.0.0.0/24 via 100.64.1.2
ip route add 48.0.0.0/24 via 100.64.2.2
ip neighbor e0 100.64.1.2 50:7c:6f:20:30:70
ip neighbor e1 100.64.2.2 50:7c:6f:20:30:71
mpls table add 0
set interface mpls e0 enable
set interface mpls e1 enable
mpls local-label add 16 eos via 100.64.2.2 e1
mpls local-label add 17 eos via 100.64.1.2 e0
```
In the first block, I'll bring up interfaces `e0` and `e1`, give them an IPv4 address in a /30
transit net, and set a route to the other side. I'll route packets destined to 16.0.0.0/24 to the
Cisco T-Rex loadtester at 100.64.1.2, and I'll route packets for 48.0.0.0/24 to the T-Rex at
100.64.2.2. To avoid the need to ARP for T-Rex, I'll set some static ARP entries to the loadtester's
MAC addresses.
In the second block, I'll enable MPLS, turn it on on the two interfaces, and add two FIB entries. If
VPP receives an MPLS packet with label 16, it'll forward it on to Cisco T-Rex on port `e1`, and if it
receives a packet with label 17, it'll forward it to T-Rex on port `e0`.
Without further ado, here are the results of the i226-V loadtest:
| ***Intel i226-V***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|---------------|----------|-------------|-----------|
| Unidirectional 1514b | 2.44Gbps | 202kpps | 99.4% |
| Unidirectional 64b Multi | 1.58Gbps | 3.28Mpps | 88.1% |
| Unidirectional 64b Single | 1.58Gbps | 3.28Mpps | 88.1% |
| Unidirectional 64b MPLS | 1.57Gbps | 3.27Mpps | 87.9% |
| Bidirectional 1514b | 4.84Gbps | 404kpps | 99.4% |
| Bidirectional 64b Multi | 2.44Gbps | 5.07Mpps | 68.2% |
| Bidirectional 64b Single | 2.44Gbps | 5.07Mpps | 68.2% |
| Bidirectional 64b MPLS | 2.43Gbps | 5.07Mpps | 68.2% |
First response: very respectable!
#### Important Notes
**1. L1 vs L2** \
There's a few observations I want to make, as these numbers can be confusing. First off, VPP when
given large packets, can easily sustain almost exactly (!) the line rate of 2.5GbE. There's always a
debate about these numbers, so let me offer offer some theoretical background --
1. The L2 Ethernet frame that Cisco T-Rex sends consists of the source/destination MAC (6
bytes each), a type (2 bytes), the payload, and a frame checksum (4 bytes). It shows us this
number as `Tx bps L2`.
1. But on the wire, the PHY has to additionally send a _preamble_ (7 bytes), a _start frame
delimiter_ (1 byte), and at the end, an _interpacket gap_ (12 bytes), which is 20 bytes of
overhead. This means that the total size on the wire will be **1534 bytes**. It shows us this
number as `Tx bps L1`.
1. This 1534 byte L1 frame on the wire is 12272 bits. For a 2.5Gigabit line rate, this means we
can send at most 2'500'000'000 / 12272 = **203715 packets per second**. Regardless of L1 or L2,
this number is always `Tx pps`.
1. The smallest (L2) Ethernet frame we're allowed to send, is 64 bytes, and anything shorter than
this is called a _Runt_. On the wire, such a frame will be 84 bytes (672 bits). With 2.5GbE, this
means **3.72Mpps** is the theoretical maximum.
When reading back loadtest results from Cisco T-Rex, it shows us packets per second (Rx pps), but it
only shows us the `Rx bps`, which is the **L2 bits/sec** which corresponds to the sending port's `Tx
bps L2`. When I describe the percentage of Line-Rate, I calculate this with what physically fits on
the wire, eg the **L1 bits/sec**, because that makes most sense to me.
When sending small 64b packets, the difference is significant: taking the above _Unidirectional 64b
Single_ as an example, I observed 3.28M packets/sec. This is a bandwidth of 3.28M\*64\*8 = 1.679Gbit
of L2 traffic, but a bandwidth of 3.28M\*(64+20)\*8 = 2.204Gbit of L1 traffic, which is how I
determine that it is 88.1% of Line-Rate.
**2. One RX queue** \
A less pedantic observation is that there is no difference between _Multi_ and _Single_ flow
loadtests. This is because the NIC only uses one RX queue, and therefor only one VPP worker thread.
I did do a few loadtests with multiple receive queues, but it does not matter for performance. When
performing this 3.28Mpps of load, I can see that VPP itself is not saturated. I can see that most of
the time it's just sitting there waiting for DPDK to give it work, which manifests as a vectors/call
relatively low:
```
---------------
Thread 2 vpp_wk_1 (lcore 2)
Time 10.9, 10 sec internal node vector rate 40.39 loops/sec 68325.87
vector rates in 3.2814e6, out 3.2814e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
dpdk-input polling 61933 2846586 0 1.28e2 45.96
ethernet-input active 61733 2846586 0 1.71e2 46.11
ip4-input-no-checksum active 61733 2846586 0 6.54e1 46.11
ip4-load-balance active 61733 2846586 0 4.70e1 46.11
ip4-lookup active 61733 2846586 0 7.50e1 46.11
ip4-rewrite active 61733 2846586 0 7.23e1 46.11
e1-output active 61733 2846586 0 2.53e1 46.11
e1-tx active 61733 2846586 0 1.38e2 46.11
```
By the way the other numbers here are fascinating as well. Take a look at them:
* **Calls**: How often has VPP executed this graph node.
* **Vectors**: How many packets (which are internally called vectors) have been handled.
* **Vectors/Call**: Every time VPP executes the graph node, on average how many packets are done
at once? An unloaded VPP will hover around 1.00, and the maximum permissible is 256.00.
* **Clocks**: How many CPU cycles, on average, did each packet spend in each graph node.
Interestingly, summing up this number gets very close to the total CPU clock cycles available
(on this machine 2.4GHz).
Zooming in on the **clocks** number a bit more: every time a packet was handled, roughly 594 CPU
cycles were spent in VPP's directed graph. An additional 128 CPU cycles were spent asking DPDK for
work. Summing it all up, 3.28M\*(594+128) = 2'369'170'800 which is earily close to the 2.4GHz I
mentioned above. I love it when the math checks out!!
By the way, in case you were wondering what happens on an unloaded VPP thread, the clocks spent
in `dpdk-input` (and other polling nodes like `unix-epoll-input`) just go up to consume the whole
core. I explain that in a bit more detail below.
**3. Uni- vs Bidirectional** \
I noticed a non-linear response between loadtests in one direction versus both directions. At large
packets, it did not matter. Both directions satured the line nearly perfectly (202kpps in one
direction, and 404kpps in both directions). However, in the smaller packets, some contention became
clear. In only one direction, IPv4 and MPLS forwarding were roughly 3.28Mpps; but in both
directions, this went down to 2.53Mpps in each direction (which is my reported 5.07Mpps). So it's
interesting to see how these i226-V chips do seem to care if they are only receiving or transmitting
transmitting, or performing both receiving *and* transmitting.
### Intel X520 (10GbE)
{{< image width="100px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
This network card is based on the classic Intel _Niantic_ chipset, also known as the 82599ES chip,
first released in 2009. It's super reliable, but there is one downside. It's a PCIe v2.0 device
(5.0GT/s) and to be able to run two ports, it will need eight lanes of PCI connectivity. However, a
quick inspection using `dmesg` shows me, that there are only 4 lanes brought to the OCP connector:
```
ixgbe 0000:05:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x4 link
at 0000:00:1c.4 (capable of 32.000 Gb/s with 5.0 GT/s PCIe x8 link)
ixgbe 0000:05:00.0: MAC: 2, PHY: 1, PBA No: H31656-000
ixgbe 0000:05:00.0: 90:e2:ba:c5:c9:38
ixgbe 0000:05:00.0: Intel(R) 10 Gigabit Network Connection
```
That's a bummer! Because there are two Tengig ports on this OCP, and this chip is a PCIe v2.0 device
which means the PCI encoding will be 8b/10b which means each lane can deliver about 80% of the
5.0GT/s, and 80% of 20GT/s is 16.0Gbit. By the way, when PCIe v3.0 was released, not only did the
transfer speed go to 8.0GT/s per lane, the encoding also changed to 128b/130b which lowers the
overhead from a whopping 20% to only 1.5%. It's not a bad investment of time to read up on PCI
Express standards on [[Wikipedia](https://en.wikipedia.org/wiki/PCI_Express)], as PCIe limitations
and blocked lanes (like in this case!) are the number one reason for poor VPP performance, as my
buddy Sander also noted during my NLNOG talk last year.
#### Intel X520: Loadtest Results
Now that I've shown a few of these runtime statistics, I think it's good to review three pertinent graphs.
I proceed to hook up the loadtester to the 10G ports of the R86S unit that has the Intel X520-DA2
adapter. I'll run the same eight loadtests:
**{1514b,64b,64b-1Q,MPLS} x {unidirectional,bidirectional}**
In the table above, I showed the output of `show runtime` in the VPP debug CLI. These numbers are
also exported in a prometheus exporter. I wrote about that in this [[article]({{< ref "2023-04-09-vpp-stats" >}})]. In Grafana, I can draw these timeseries as graphs, and it shows me a lot
about where VPP is spending its time. Each _node_ in the directed graph counts how many vectors
(packets) it has seen, and how many CPU cycles it has spent doing its work.
{{< image src="/assets/r86s/grafana-vectors.png" alt="Grafana Vectors" >}}
In VPP, a graph of _vectors/sec_ means how many packets per second is the router forwarding. The
graph above is on a logarithmic scale, and I've annotated each of the eight loadtests in orange. The
first block of four are the ***U***_nidirectional_ tests and of course, higher values is better.
I notice that some of these loadtests ramp up until a certain point, after which they become a <span
style='color:orange;font-weight:bold;'>flatline</span>, which I drew orange arrows for. The first
time this clearly happens is in the ***U3*** loadtest. It makes sense to me, because having one flow
implies only one worker thread, whereas in the ***U2*** loadtest the system can make use of multiple
receive queues and therefore multiple worker threads. It stands to reason that ***U2*** has a
slightly better performance than ***U3***.
The fourth test, the _MPLS_ loadtest, is forwarding the same identical packets with label 16, out on
another interface with label 17. They are therefore also single flow, and this explains why the
***U4*** loadtest looks very similar to the ***U3*** one. Some NICs can hash MPLS traffic to
multiple receive queues based on the inner payload, but I conclude that the Intel X520-DA2 aka
82599ES cannot do that.
The second block of four are the ***B***_idirectional_ tests. Similar to the tests I did with the
i226-V 2.5GbE NICs, here each of the network cards has to both receive traffic as well as sent
traffic. It is with this graph that I can determine the overall throughput in packets/sec
of these network interfaces. Of course the bits/sec and packets/sec also come from the T-Rex
loadtester output JSON. Here they are, for the Intel X520-DA2:
| ***Intel 82599ES***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|---------------|----------|-------------|-----------|
| U1: Unidirectional 1514b | 9.77Gbps | 809kpps | 99.2% |
| U2: Unidirectional 64b Multi | 6.48Gbps | 13.4Mpps | 90.1% |
| U3: Unidirectional 64b Single | 3.73Gbps | 7.77Mpps | 52.2% |
| U4: Unidirectional 64b MPLS | 3.32Gbps | 6.91Mpps | 46.4% |
| B1: Bidirectional 1514b | 12.9Gbps | 1.07Mpps | 65.6% |
| B2: Bidirectional 64b Multi | 6.08Gbps | 12.7Mpps | 42.7% |
| B3: Bidirectional 64b Single | 6.25Gbps | 13.0Mpps | 43.7% |
| B4: Bidirectional 64b MPLS | 3.26Gbps | 6.79Mpps | 22.8% |
A few further observations:
1. ***U1***'s loadtest shows that the machine can sustain 10Gbps in one direction, while ***B1***
shows that bidirectional loadtests are not yielding twice as much throughput. This is very
likely because the PCIe 5.0GT/s x4 link is constrained to 16Gbps total throughput, while the
OCP NIC supports PCIe 5.0GT/s x8 (32Gbps).
1. ***U3***'s loadtest shows that one single CPU can do 7.77Mpps max, if it's the only CPU that is
doing work. This is likely because if it's the only thread doing work, it gets to use the
entire L2/L3 cache for itself.
1. ***U2***'s test shows that when multiple workers perform work, the throughput raises to
13.4Mpps, but this is not double that of a single worker. Similar to before, I think this is
because the threads now need to share the CPU's modest L2/L3 cache.
1. ***B3***'s loadtest shows that two CPU threads together can do 6.50Mpps each (for a total of
13.0Mpps), which I think is likely because each NIC now has to receive _and_ transit packets.
If you're reading this and think you have an alternative explanation, do let me know!
### Mellanox Cx3 (10GbE)
When VPP is doing its work, it typically asks DPDK (or other input types like virtio, AVF, or RDMA)
for a _list_ of packets, rather than one individual packet. It then brings these packets, called
_vectors_, through a directed acyclic graph inside of VPP. Each graph node does something specific to
the packets, for example in `ethernet-input`, the node checks what ethernet type each packet is (ARP,
IPv4, IPv6, MPLS, ...), and hands them off to the correct next node, such as `ip4-input` or `mpls-input`.
If VPP is idle, there may be only one or two packets in the list, which means every time the packets
go into a new node, a new chunk of code has to be loaded from working memory into the CPU's
instruction cache. Conversely, if there are many packets in the list, only the first packet may need
to pull things into the i-cache, the second through Nth packet will become cache hits and execute
_much_ faster. Moreover, some nodes in VPP make use of processor optimizations like _SIMD_ (single
instruction, multiple data), to save on clock cycles if the same operation needs to be executed
multiple times.
{{< image src="/assets/r86s/grafana-clocks.png" alt="Grafana Clocks" >}}
This graph shows the average CPU cycles per packet for each node. In the first three loadtests
(***U1***, ***U2*** and ***U3***), you can see four lines representing the VPP nodes `ip4-input`
`ip4-lookup`, `ip4-load-balance` and `ip4-rewrite`. In the fourth loadtest ***U4***, you can see
only three nodes: `mpls-input`, `mpls-lookup`, and `ip4-mpls-label-disposition-pipe` (where the MPLS
label '16' is swapped for outgoing label '17').
It's clear to me that when VPP has not many packets/sec to route (ie ***U1*** loadtest), that the
cost _per packet_ is actually quite high at around 200 CPU cycles per packet per node. But, if I
slam the VPP instance with lots of packets/sec (ie ***U3*** loadtest), that VPP gets _much_ more
efficient at what it does. What used to take 200+ cycles per packet, now only takes between 34-52
cycles per packet, which is a whopping 5x increase in efficiency. How cool is that?!
And with that, the Mellanox C3 loadtest completes, and the results are in:
| ***Mellanox MCX342A-XCCN***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|---------------|----------|-------------|-----------|
| U1: Unidirectional 1514b | 9.73Gbps | 805kpps | 99.7% |
| U2: Unidirectional 64b Multi | 1.11Gbps | 2.30Mpps | 15.5% |
| U3: Unidirectional 64b Single | 1.10Gbps | 2.27Mpps | 15.3% |
| U4: Unidirectional 64b MPLS | 1.10Gbps | 2.27Mpps | 15.3% |
| B1: Bidirectional 1514b | 18.7Gbps | 1.53Mpps | 94.9% |
| B2: Bidirectional 64b Multi | 1.54Gbps | 2.29Mpps | 7.69% |
| B3: Bidirectional 64b Single | 1.54Gbps | 2.29Mpps | 7.69% |
| B4: Bidirectional 64b MPLS | 1.54Gbps | 2.29Mpps | 7.69% |
Here's something that I find strange though. VPP is clearly not saturated by these 64b loadtests. I
know this, because in the case of the Intel X520-DA2 above, I could easily see 13Mpps in a
bidirectional test, yet with this Mellanox Cx3 card, no matter if I do one direction or both
directions, the max packets/sec tops at 2.3Mpps only -- that's an order of magnitude lower.
Looking at VPP, both worker threads (the one reading from Port 5/0/0, and the other reading from
Port 5/0/1), are not very busy at all. If a VPP worker thread is saturated, this typically shows as
a vectors/call of 256.00 and 100% of CPU cycles consumed. But here, that's not the case at all, and
most time is spent in DPDK waiting for traffic:
```
Thread 1 vpp_wk_0 (lcore 1)
Time 31.2, 10 sec internal node vector rate 2.26 loops/sec 988626.15
vector rates in 1.1521e6, out 1.1521e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
FiftySixGigabitEthernet5/0/1-o active 15949560 35929200 0 8.39e1 2.25
FiftySixGigabitEthernet5/0/1-t active 15949560 35929200 0 2.59e2 2.25
dpdk-input polling 36250611 35929200 0 6.55e2 .99
ethernet-input active 15949560 35929200 0 2.69e2 2.25
ip4-input-no-checksum active 15949560 35929200 0 1.01e2 2.25
ip4-load-balance active 15949560 35929200 0 7.64e1 2.25
ip4-lookup active 15949560 35929200 0 9.26e1 2.25
ip4-rewrite active 15949560 35929200 0 9.28e1 2.25
unix-epoll-input polling 35367 0 0 1.29e3 0.00
---------------
Thread 2 vpp_wk_1 (lcore 2)
Time 31.2, 10 sec internal node vector rate 2.43 loops/sec 659534.38
vector rates in 1.1517e6, out 1.1517e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
FiftySixGigabitEthernet5/0/0-o active 14845221 35913927 0 8.66e1 2.42
FiftySixGigabitEthernet5/0/0-t active 14845221 35913927 0 2.72e2 2.42
dpdk-input polling 23114538 35913927 0 6.99e2 1.55
ethernet-input active 14845221 35913927 0 2.65e2 2.42
ip4-input-no-checksum active 14845221 35913927 0 9.73e1 2.42
ip4-load-balance active 14845220 35913923 0 7.17e1 2.42
ip4-lookup active 14845221 35913927 0 9.03e1 2.42
ip4-rewrite active 14845221 35913927 0 8.97e1 2.42
unix-epoll-input polling 22551 0 0 1.37e3 0.00
```
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
I kind of wonder why that is. Is the Mellanox Connect-X3 such a poor performer? Or does it not like
small packets? I've read online that Mellanox cards do some form of message compression on the PCI
bus, something perhaps to turn off. I don't know, but I don't like it!
### Mellanox Cx5 (25GbE)
VPP has a few _polling_ nodes, which are pieces of code that execute back-to-back in a tight
execution loop. A classic example of a _polling_ node is a _Poll Mode Driver_ from DPDK: this will
ask the network cards if they have any packets, and if so: marshall them through the directed graph
of VPP. As soon as that's done, the node will immediately ask again. If there is no work to do, this
turns into a tight loop with DPDK continuously asking for work. There is however another, lesser
known, _polling_ node: `unix-epoll-input`. This node services a local pool of file descriptors, like
the _Linux Control Plane_ netlink socket for example, or the clients attached to the Statistics
segment, CLI or API. You can see the open files with `show unix files`.
This design explains why the CPU load of a typical DPDK application is 100% of each worker thread. As
an aside, you can ask the PMD to start off in _interrupt_ mode, and only after a certain load switch
seemlessly to _polling_ mode. Take a look at `set interface rx-mode` on how to change from _polling_
to _interrupt_ or _adaptive_ modes. For performance reasons, I always leave the node in _polling_
mode (the default in VPP).
{{< image src="/assets/r86s/grafana-cpu.png" alt="Grafana CPU" >}}
The stats segment shows how many clock cycles are being spent in each call of each node. It also
knows how often nodes are called. Considering the `unix-epoll-input` and `dpdk-input` nodes will
perform what is essentially a tight-loop, the CPU should always add up to 100%. I found that one
cool way to show how busy a VPP instance really is, is to look over all CPU threads, and sort
through the fraction of time spent in each node:
* ***Input Nodes***: are those which handle the receive path from DPDK and into the directed graph
for routing -- for example `ethernet-input`, then `ip4-input` through to `ip4-lookup` and finally
`ip4-rewrite`. This is where VPP usually spends most of its CPU cycles.
* ***Output Nodes***: are those which handle the transmit path into DPDK. You'll see these are
nodes whose name ends in `-output` or `-tx`. You can also see that in ***U2***, there are only
two nodes consuming CPU, while in ***B2*** there are four nodes (because two interfaces are
transmitting!)
* ***epoll***: the _polling_ node called `unix-epoll-input` depicted in <span
style='color:brown;font-weight:bold'>brown</span> in this graph.
* ***dpdk***: the _polling_ node called `dpdk-input` depicted in <span
style='color:green;font-weight:bold'>green</span> in this graph.
If there is no work to do, as was the case at around 20:30 in the graph above, the _dpdk_ and
_epoll_ nodes are the only two that are consuming CPU. If there's lots of work to do, as was the
case in the unidirectional 64b loadtest between 19:40-19:50, and the bidirectional 64b loadtest
between 20:45-20:55, I can observe lots of other nodes doing meaningful work, ultimately starving
the _dpdk_ and _epoll_ threads until an equilibrium is achieved. This is how I know the VPP process
is the bottleneck and not, for example, the PCI bus.
I let the eight loadtests run, and make note of the bits/sec and packets/sec for each, in this table
for the Mellanox Cx5:
| ***Mellanox MCX542_ACAT***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|---------------|----------|-------------|-----------|
| U1: Unidirectional 1514b | 24.2Gbps | 2.01Mpps | 98.6% |
| U2: Unidirectional 64b Multi | 7.43Gbps | 15.5Mpps | 41.6% |
| U3: Unidirectional 64b Single | 3.52Gbps | 7.34Mpps | 19.7% |
| U4: Unidirectional 64b MPLS | 7.34Gbps | 15.3Mpps | 46.4% |
| B1: Bidirectional 1514b | 24.9Gbps | 2.06Mpps | 50.4% |
| B2: Bidirectional 64b Multi | 6.58Gbps | 13.7Mpps | 18.4% |
| B3: Bidirectional 64b Single | 3.15Gbps | 6.55Mpps | 8.81% |
| B4: Bidirectional 64b MPLS | 6.55Gbps | 13.6Mpps | 18.3% |
Some observations:
1. This Mellanox Cx5 runs quite a bit hotter than the other two cards. It's a PCIe v3.0 which means
that despite there only being 4 lanes to the OCP port, it can achieve 31.504 Gbit/s (in case
you're wondering, this is 128b/130b encoding on 8.0GT/s x4).
1. It easily saturates 25Gbit in one direction with big packets in ***U1***, but as soon as smaller
packets are offered, each worker thread tops out at 7.34Mpps or so in ***U2***.
1. When testing in both directions, each thread can do about 6.55Mpps or so in ***B2***. Similar to
the other NICs, there is a clear slowdown due to CPU cache contention (when using multiple
threads), and RX/TX simultaneously (when doing bidirectional tests).
1. MPLS is a lot faster -- nearly double based on the use of multiple threads. I think this is
because the Cx5 has a hardware hashing function for MPLS packets that looks at the inner
payload to sort the traffic into multiple queues, while the Cx3 and Intel X520-DA2 do not.
## Summary and closing thoughts
There's a lot to say about these OCP cards. While the Intel is cheap, the Mellanox Cx3 is a bit
quirky with its VPP enumeration, and the Mellanox Cx5 is a bit more expensive (and draws a fair bit
more power, coming in at 20W), but it do 25Gbit reasonably, it's pretty difficult to make a solid
recommendation. What I find interesting is the very low limit in packets/sec on 64b packets coming
from the Cx3, while at the same time there seems to be an added benefit in MPLS hashing that the
other two cards do not have.
All things considered, I think I would recommend the Intel x520-DA2 (based on the _Niantic_ chip,
Intel 82599ES, total machine coming in at 17W). It seems like it pairs best with the available CPU
on the machine. Maybe a Mellanox ConnectX-4 could be a good alternative though, hmmmm :)
Here's a few files I gathered along the way, in case they are useful:
* [[LSCPU](/assets/r86s/lscpu.txt)] - [[Likwid Topology](/assets/r86s/likwid-topology.txt)] -
[[DMI Decode](/assets/r86s/dmidecode.txt)] - [[LSBLK](/assets/r86s/lsblk.txt)]
* Mellanox Cx341: [[dmesg](/assets/r86s/dmesg-cx3.txt)] - [[LSPCI](/assets/r86s/lspci-cx3.txt)] -
[[LSHW](/assets/r86s/lshw-cx3.txt)] - [[VPP Patch](/assets/r86s/vpp-cx3.patch)]
* Mellanox Cx542: [[dmesg](/assets/r86s/dmesg-cx5.txt)] - [[LSPCI](/assets/r86s/lspci-cx5.txt)] -
[[LSHW](/assets/r86s/lshw-cx5.txt)]
* Intel X520-DA2: [[dmesg](/assets/r86s/dmesg-x520.txt)] - [[LSPCI](/assets/r86s/lspci-x520.txt)] -
[[LSHW](/assets/r86s/lshw-x520.txt)]
* VPP Configs: [[startup.conf](/assets/r86s/vpp/startup.conf)] - [[L2 Config](/assets/r86s/vpp/config/l2.vpp)] -
[[L3 Config](/assets/r86s/vpp/config/l3.vpp)] - [[MPLS Config](/assets/r86s/vpp/config/mpls.vpp)]