569 lines
35 KiB
Markdown
569 lines
35 KiB
Markdown
---
|
|
date: "2024-07-05T12:51:23Z"
|
|
title: 'Review: R86S (Jasper Lake - N6005)'
|
|
aliases:
|
|
- /s/articles/2024/07/05/r86s.html
|
|
---
|
|
|
|
# Introduction
|
|
|
|
{{< image width="250px" float="right" src="/assets/r86s/r86s-front.png" alt="R86S Front" >}}
|
|
|
|
I am always interested in finding new hardware that is capable of running VPP. Of course, a standard
|
|
issue 19" rack mountable machine like a Dell, HPE or SuperMicro machine is an obvious choice. They
|
|
come with redundant power supplies, PCIe v3.0 or better expansion slots, and can boot off of mSATA
|
|
or NVME, with plenty of RAM. But for some people and in some locations, the power envelope or
|
|
size/cost of these 19" rack mountable machines can be prohibitive. Sometimes, just having a smaller
|
|
form factor can be very useful:
|
|
|
|
***Enter the GoWin R86S!***
|
|
|
|
{{< image width="250px" float="right" src="/assets/r86s/r86s-nvme.png" alt="R86S NVME" >}}
|
|
|
|
I stumbled across this lesser known build from GoWin, which is an ultra compact but modern design,
|
|
featuring three 2.5GbE ethernet ports and optionally two 10GbE, or as I'll show here, two 25GbE
|
|
ports. What I really liked about the machine is that it comes with 32GB of LPDDR4 memory and can boot
|
|
off of an m.2 NVME -- which makes it immediately an appealing device to put in the field. I
|
|
noticed that the height of the machine is just a few millimeters smaller than 1U which is 1.75"
|
|
(44.5mm), which gives me the bright idea to 3D print a bracket to be able to rack these and because
|
|
they are very compact -- a width of 78mm only, I can manage to fit four of them in one 1U front, or
|
|
maybe a Mikrotik CRS305 breakout switch. Slick!
|
|
|
|
{{< image width="250px" float="right" src="/assets/r86s/r86s-ocp.png" alt="R86S OCP" >}}
|
|
|
|
I picked up two of these _R86S Pro_ and when they arrived, I noticed that their 10GbE is actually
|
|
an _Open Compute Project_ (OCP) footprint expansion card, which struck me as clever. It means that I
|
|
can replace the Mellanox `CX342A` network card with perhaps something more modern, such as an Intel
|
|
`X520-DA2` or Mellanox `MCX542B_ACAN` which is even dual-25G! So I take to ebay and buy myself a few
|
|
expansion OCP boards, which are surprisingly cheap, perhaps because the OCP form factor isn't as
|
|
popular as 'normal' PCIe v3.0 cards.
|
|
|
|
I put a Google photos album online [[here](https://photos.app.goo.gl/gPMAp21FcXFiuNaH7)], in case
|
|
you'd like some more detailed shots.
|
|
|
|
In this article, I'll write about a mixture of hardware, systems engineering (how the hardware like
|
|
network cards and motherboard and CPU interact with one another), and VPP performance diagnostics.
|
|
I hope that it helps a few wary Internet denizens feel their way around these challenging but
|
|
otherwise fascinating technical topics. Ready? Let's go!
|
|
|
|
# Hardware Specs
|
|
|
|
{{< image width="250px" float="right" src="/assets/r86s/nics.png" alt="NICs" >}}
|
|
|
|
For the CHF 314,- I paid for each Intel Pentium N6005, this machine is delightful! They feature:
|
|
|
|
* Intel Pentium Silver N6005 @ 2.00GHz (4 cores)
|
|
* 2x16GB Micron LPDDR4 memory @2933MT/s
|
|
* 1x Samsung SSD 980 PRO 1TB NVME
|
|
* 3x Intel I226-V 2.5GbE network ports
|
|
* 1x OCP v2.0 connector with PCIe v3.0 x4 delivered
|
|
* USB-C power supply
|
|
* 2x USB3 (one on front, one on side)
|
|
* 1x USB2 (on the side)
|
|
* 1x MicroSD slot
|
|
* 1x MicroHDMI video out
|
|
* Wi-Fi 6 AX201 160MHz onboard
|
|
|
|
To the right I've put the three OCP nework interface cards side by side. On the top, the Mellanox
|
|
Cx3 (2x10G) that shipped with the R86S units. In the middle, a spiffy Mellanox Cx5 (2x25G), and at
|
|
the bottom, the _classic_ Intel 82599ES (2x10G) card. As I'll demonstrate, despite having the same
|
|
form factor, each of these have a unique story to tell, well beyond their rated portspeed.
|
|
|
|
There's quite a few options for CPU out there - GoWin sells them with Jasper Lake (Celeron N5105
|
|
or Pentium N6005, the one I bought), but also with newer Alder Lake (N100 or N305). Price,
|
|
performance and power draw will vary. I looked at a few differences in Passmark, and I think I made
|
|
a good trade off between cost, power and performance. You may of course choose differently!
|
|
|
|
The R86S formfactor is very compact, coming in at (80mm x 120mm x 40mm), and the case is made of
|
|
sturdy aluminium. It feels like a good quality build, and the inside is also pretty neat. In the
|
|
kit, a cute little M2 hex driver is included. This allows me to remove the bottom plate (to service
|
|
the NVME) and separate the case to access the OCP connector (and replace the NIC!). Finally, the two
|
|
antennae at the back are tri-band, suitable for WiFi 6. There is one fan included in the chassis,
|
|
with a few cut-outs in the top of the case, to let the air flow through the case. The fan is not
|
|
noisy, but definitely noticeable.
|
|
|
|
## Compiling VPP on R86S
|
|
|
|
I first install Debian Bookworm on them, and retrofit one of them with the Intel X520 and the other
|
|
with the Mellanox Cx5 network cards. While the Mellanox Cx342A that comes with the R86S does have
|
|
DPDK support (using the MLX4 poll mode driver), it has a quirk in that it does not enumerate both
|
|
ports as unique PCI devices, causing VPP to crash with duplicate graph node names:
|
|
|
|
```
|
|
vlib_register_node:418: more than one node named `FiftySixGigabitEthernet5/0/0-tx'
|
|
Failed to save post-mortem API trace to /tmp/api_post_mortem.794
|
|
received signal SIGABRT, PC 0x7f9445aa9e2c
|
|
```
|
|
|
|
The way VPP enumerates DPDK devices is by walking the PCI bus, but considering the Connect-X3 has
|
|
two ports behind the same PCI address, it'll try to create two interfaces, which fails. It's pretty
|
|
easily fixable with a small [[patch](/assets/r86s/vpp-cx3.patch)]. Off I go, to compile VPP (version
|
|
`24.10-rc0~88-ge3469369dd`) with Mellanox DPDK support, to get the best side by side comparison
|
|
between the Cx3 and X520 cards on the one hand needing DPDK, and the Cx5 card optionally also being
|
|
able to use VPP's RDMA driver. They will _all_ be using DPDK in my tests.
|
|
|
|
I'm not out of the woods yet, because VPP throws an error when enumerating and attaching the
|
|
Mellanox Cx342. I read the DPDK documentation for this poll mode driver
|
|
[[ref](https://doc.dpdk.org/guides/nics/mlx4.html)] and find that when using DPDK applications, the
|
|
`mlx4_core` driver in the kernel has to be initialized with a specific flag, like so:
|
|
|
|
```
|
|
GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=1-3 iommu=on intel_iommu=on mlx4_core.log_num_mgm_entry_size=-1"
|
|
```
|
|
|
|
And because I'm using `iommu`, the correct driver to load for Cx3 is `vfio_pci`, so I put that in
|
|
`/etc/modules`, rebuild the initrd, and reboot the machine. With all of that sleuthing out of the
|
|
way, I am now ready to take the R86S out for a spin and see how much this little machine is capable
|
|
of forwarding as a router.
|
|
|
|
### Power: Idle and Under Load
|
|
I note that the Intel Pentium Silver CPU has 4 cores, one of which will be used by OS and
|
|
controlplane, leaving 3 worker threads left for VPP. The Pentium Silver N6005 comes with 32kB of L1
|
|
per core, and 1.5MB of L2 + 4MB of L3 cache shared between the cores. It's not much, but then again
|
|
the TDP is shockingly low 10 Watts. Before VPP runs (and makes the CPUs work really hard), the
|
|
entire machine idles at 12 Watts. When powered on under full load, the Mellanox Cx3 and Intel
|
|
x520-DA2 both sip 17 Watts of power and the Mellanox Cx5 slurps 20 Watts of power all-up. Neat!
|
|
|
|
## Loadtest Results
|
|
|
|
{{< image width="400px" float="right" src="/assets/r86s/loadtest.png" alt="Loadtest" >}}
|
|
|
|
For each network interface I will do a bunch of loadtests, to show different aspects of the setup.
|
|
First, I'll do a bunch of unidirectional tests, where traffic goes into one port and exits another.
|
|
I will do this with either large packets (1514b), small packets (64b) but many flows, which allow me
|
|
to use multiple hardware receive queues assigned to individual worker threads, or small packets with
|
|
only one flow, limiting VPP to only one RX queue and consequently only one CPU thread. Because I
|
|
think it's hella cool, I will also loadtest MPLS label switching (eg. MPLS frame with label '16' on
|
|
ingress, forwarded with a swapped label '17' on egress). In general, MPLS lookups can be a bit
|
|
faster as they are (constant time) hashtable lookups, while IPv4 longest prefix match lookups use a
|
|
trie. MPLS won't be significantly faster than IPv4 in these tests, because the FIB is tiny with
|
|
only a handful of entries.
|
|
|
|
Second, I'll do the same loadtests but in both directions, which means traffic is both entering NIC0
|
|
and being emitted on NIC1, but also entering on NIC1 to be emitted on NIC0. In these loadtests,
|
|
again large packets, small packets multi-flow, small packets single-flow, and MPLS, the network chip
|
|
has to do more work to maintain its RX queues *and* its TX queues simultaneously. As I'll
|
|
demonstrate, this tends to matter quite a bit on consumer hardware.
|
|
|
|
### Intel i226-V (2.5GbE)
|
|
|
|
This is a 2.5G network interface from the _Foxville_ family, released in Q2 2022 with a ten year
|
|
expected availability, it's currently a very good choice. It is a consumer/client chip, which means
|
|
I cannot expect super performance from it. In this machine, the three RJ45 ports are connected to
|
|
PCI slot 01:00.0, 02:00.0 and 03:00.0, each at 5.0GT/s (this means they are PCIe v2.0) and they take
|
|
one x1 PCIe lane to the CPU. I leave the first port as management, and take the second+third one and
|
|
give them to VPP like so:
|
|
|
|
```
|
|
dpdk {
|
|
dev 0000:02:00.0 { name e0 }
|
|
dev 0000:03:00.0 { name e1 }
|
|
no-multi-seg
|
|
decimal-interface-names
|
|
uio-driver vfio-pci
|
|
}
|
|
```
|
|
|
|
The logical configuration then becomes:
|
|
|
|
```
|
|
set int state e0 up
|
|
set int state e1 up
|
|
set int ip address e0 100.64.1.1/30
|
|
set int ip address e1 100.64.2.1/30
|
|
ip route add 16.0.0.0/24 via 100.64.1.2
|
|
ip route add 48.0.0.0/24 via 100.64.2.2
|
|
ip neighbor e0 100.64.1.2 50:7c:6f:20:30:70
|
|
ip neighbor e1 100.64.2.2 50:7c:6f:20:30:71
|
|
|
|
mpls table add 0
|
|
set interface mpls e0 enable
|
|
set interface mpls e1 enable
|
|
mpls local-label add 16 eos via 100.64.2.2 e1
|
|
mpls local-label add 17 eos via 100.64.1.2 e0
|
|
```
|
|
|
|
In the first block, I'll bring up interfaces `e0` and `e1`, give them an IPv4 address in a /30
|
|
transit net, and set a route to the other side. I'll route packets destined to 16.0.0.0/24 to the
|
|
Cisco T-Rex loadtester at 100.64.1.2, and I'll route packets for 48.0.0.0/24 to the T-Rex at
|
|
100.64.2.2. To avoid the need to ARP for T-Rex, I'll set some static ARP entries to the loadtester's
|
|
MAC addresses.
|
|
|
|
In the second block, I'll enable MPLS, turn it on on the two interfaces, and add two FIB entries. If
|
|
VPP receives an MPLS packet with label 16, it'll forward it on to Cisco T-Rex on port `e1`, and if it
|
|
receives a packet with label 17, it'll forward it to T-Rex on port `e0`.
|
|
|
|
Without further ado, here are the results of the i226-V loadtest:
|
|
|
|
| ***Intel i226-V***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|
|
|---------------|----------|-------------|-----------|
|
|
| Unidirectional 1514b | 2.44Gbps | 202kpps | 99.4% |
|
|
| Unidirectional 64b Multi | 1.58Gbps | 3.28Mpps | 88.1% |
|
|
| Unidirectional 64b Single | 1.58Gbps | 3.28Mpps | 88.1% |
|
|
| Unidirectional 64b MPLS | 1.57Gbps | 3.27Mpps | 87.9% |
|
|
| Bidirectional 1514b | 4.84Gbps | 404kpps | 99.4% |
|
|
| Bidirectional 64b Multi | 2.44Gbps | 5.07Mpps | 68.2% |
|
|
| Bidirectional 64b Single | 2.44Gbps | 5.07Mpps | 68.2% |
|
|
| Bidirectional 64b MPLS | 2.43Gbps | 5.07Mpps | 68.2% |
|
|
|
|
First response: very respectable!
|
|
|
|
#### Important Notes
|
|
|
|
**1. L1 vs L2** \
|
|
There's a few observations I want to make, as these numbers can be confusing. First off, VPP when
|
|
given large packets, can easily sustain almost exactly (!) the line rate of 2.5GbE. There's always a
|
|
debate about these numbers, so let me offer offer some theoretical background --
|
|
|
|
1. The L2 Ethernet frame that Cisco T-Rex sends consists of the source/destination MAC (6
|
|
bytes each), a type (2 bytes), the payload, and a frame checksum (4 bytes). It shows us this
|
|
number as `Tx bps L2`.
|
|
1. But on the wire, the PHY has to additionally send a _preamble_ (7 bytes), a _start frame
|
|
delimiter_ (1 byte), and at the end, an _interpacket gap_ (12 bytes), which is 20 bytes of
|
|
overhead. This means that the total size on the wire will be **1534 bytes**. It shows us this
|
|
number as `Tx bps L1`.
|
|
1. This 1534 byte L1 frame on the wire is 12272 bits. For a 2.5Gigabit line rate, this means we
|
|
can send at most 2'500'000'000 / 12272 = **203715 packets per second**. Regardless of L1 or L2,
|
|
this number is always `Tx pps`.
|
|
1. The smallest (L2) Ethernet frame we're allowed to send, is 64 bytes, and anything shorter than
|
|
this is called a _Runt_. On the wire, such a frame will be 84 bytes (672 bits). With 2.5GbE, this
|
|
means **3.72Mpps** is the theoretical maximum.
|
|
|
|
When reading back loadtest results from Cisco T-Rex, it shows us packets per second (Rx pps), but it
|
|
only shows us the `Rx bps`, which is the **L2 bits/sec** which corresponds to the sending port's `Tx
|
|
bps L2`. When I describe the percentage of Line-Rate, I calculate this with what physically fits on
|
|
the wire, eg the **L1 bits/sec**, because that makes most sense to me.
|
|
|
|
When sending small 64b packets, the difference is significant: taking the above _Unidirectional 64b
|
|
Single_ as an example, I observed 3.28M packets/sec. This is a bandwidth of 3.28M\*64\*8 = 1.679Gbit
|
|
of L2 traffic, but a bandwidth of 3.28M\*(64+20)\*8 = 2.204Gbit of L1 traffic, which is how I
|
|
determine that it is 88.1% of Line-Rate.
|
|
|
|
**2. One RX queue** \
|
|
A less pedantic observation is that there is no difference between _Multi_ and _Single_ flow
|
|
loadtests. This is because the NIC only uses one RX queue, and therefor only one VPP worker thread.
|
|
I did do a few loadtests with multiple receive queues, but it does not matter for performance. When
|
|
performing this 3.28Mpps of load, I can see that VPP itself is not saturated. I can see that most of
|
|
the time it's just sitting there waiting for DPDK to give it work, which manifests as a vectors/call
|
|
relatively low:
|
|
|
|
```
|
|
---------------
|
|
Thread 2 vpp_wk_1 (lcore 2)
|
|
Time 10.9, 10 sec internal node vector rate 40.39 loops/sec 68325.87
|
|
vector rates in 3.2814e6, out 3.2814e6, drop 0.0000e0, punt 0.0000e0
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
dpdk-input polling 61933 2846586 0 1.28e2 45.96
|
|
ethernet-input active 61733 2846586 0 1.71e2 46.11
|
|
ip4-input-no-checksum active 61733 2846586 0 6.54e1 46.11
|
|
ip4-load-balance active 61733 2846586 0 4.70e1 46.11
|
|
ip4-lookup active 61733 2846586 0 7.50e1 46.11
|
|
ip4-rewrite active 61733 2846586 0 7.23e1 46.11
|
|
e1-output active 61733 2846586 0 2.53e1 46.11
|
|
e1-tx active 61733 2846586 0 1.38e2 46.11
|
|
```
|
|
|
|
By the way the other numbers here are fascinating as well. Take a look at them:
|
|
* **Calls**: How often has VPP executed this graph node.
|
|
* **Vectors**: How many packets (which are internally called vectors) have been handled.
|
|
* **Vectors/Call**: Every time VPP executes the graph node, on average how many packets are done
|
|
at once? An unloaded VPP will hover around 1.00, and the maximum permissible is 256.00.
|
|
* **Clocks**: How many CPU cycles, on average, did each packet spend in each graph node.
|
|
Interestingly, summing up this number gets very close to the total CPU clock cycles available
|
|
(on this machine 2.4GHz).
|
|
|
|
Zooming in on the **clocks** number a bit more: every time a packet was handled, roughly 594 CPU
|
|
cycles were spent in VPP's directed graph. An additional 128 CPU cycles were spent asking DPDK for
|
|
work. Summing it all up, 3.28M\*(594+128) = 2'369'170'800 which is earily close to the 2.4GHz I
|
|
mentioned above. I love it when the math checks out!!
|
|
|
|
By the way, in case you were wondering what happens on an unloaded VPP thread, the clocks spent
|
|
in `dpdk-input` (and other polling nodes like `unix-epoll-input`) just go up to consume the whole
|
|
core. I explain that in a bit more detail below.
|
|
|
|
**3. Uni- vs Bidirectional** \
|
|
I noticed a non-linear response between loadtests in one direction versus both directions. At large
|
|
packets, it did not matter. Both directions satured the line nearly perfectly (202kpps in one
|
|
direction, and 404kpps in both directions). However, in the smaller packets, some contention became
|
|
clear. In only one direction, IPv4 and MPLS forwarding were roughly 3.28Mpps; but in both
|
|
directions, this went down to 2.53Mpps in each direction (which is my reported 5.07Mpps). So it's
|
|
interesting to see how these i226-V chips do seem to care if they are only receiving or transmitting
|
|
transmitting, or performing both receiving *and* transmitting.
|
|
|
|
### Intel X520 (10GbE)
|
|
|
|
{{< image width="100px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
|
|
|
This network card is based on the classic Intel _Niantic_ chipset, also known as the 82599ES chip,
|
|
first released in 2009. It's super reliable, but there is one downside. It's a PCIe v2.0 device
|
|
(5.0GT/s) and to be able to run two ports, it will need eight lanes of PCI connectivity. However, a
|
|
quick inspection using `dmesg` shows me, that there are only 4 lanes brought to the OCP connector:
|
|
|
|
```
|
|
ixgbe 0000:05:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x4 link
|
|
at 0000:00:1c.4 (capable of 32.000 Gb/s with 5.0 GT/s PCIe x8 link)
|
|
ixgbe 0000:05:00.0: MAC: 2, PHY: 1, PBA No: H31656-000
|
|
ixgbe 0000:05:00.0: 90:e2:ba:c5:c9:38
|
|
ixgbe 0000:05:00.0: Intel(R) 10 Gigabit Network Connection
|
|
```
|
|
|
|
That's a bummer! Because there are two Tengig ports on this OCP, and this chip is a PCIe v2.0 device
|
|
which means the PCI encoding will be 8b/10b which means each lane can deliver about 80% of the
|
|
5.0GT/s, and 80% of 20GT/s is 16.0Gbit. By the way, when PCIe v3.0 was released, not only did the
|
|
transfer speed go to 8.0GT/s per lane, the encoding also changed to 128b/130b which lowers the
|
|
overhead from a whopping 20% to only 1.5%. It's not a bad investment of time to read up on PCI
|
|
Express standards on [[Wikipedia](https://en.wikipedia.org/wiki/PCI_Express)], as PCIe limitations
|
|
and blocked lanes (like in this case!) are the number one reason for poor VPP performance, as my
|
|
buddy Sander also noted during my NLNOG talk last year.
|
|
|
|
#### Intel X520: Loadtest Results
|
|
|
|
Now that I've shown a few of these runtime statistics, I think it's good to review three pertinent graphs.
|
|
I proceed to hook up the loadtester to the 10G ports of the R86S unit that has the Intel X520-DA2
|
|
adapter. I'll run the same eight loadtests:
|
|
**{1514b,64b,64b-1Q,MPLS} x {unidirectional,bidirectional}**
|
|
|
|
In the table above, I showed the output of `show runtime` in the VPP debug CLI. These numbers are
|
|
also exported in a prometheus exporter. I wrote about that in this [[article]({{< ref "2023-04-09-vpp-stats" >}})]. In Grafana, I can draw these timeseries as graphs, and it shows me a lot
|
|
about where VPP is spending its time. Each _node_ in the directed graph counts how many vectors
|
|
(packets) it has seen, and how many CPU cycles it has spent doing its work.
|
|
|
|
{{< image src="/assets/r86s/grafana-vectors.png" alt="Grafana Vectors" >}}
|
|
|
|
In VPP, a graph of _vectors/sec_ means how many packets per second is the router forwarding. The
|
|
graph above is on a logarithmic scale, and I've annotated each of the eight loadtests in orange. The
|
|
first block of four are the ***U***_nidirectional_ tests and of course, higher values is better.
|
|
|
|
I notice that some of these loadtests ramp up until a certain point, after which they become a <span
|
|
style='color:orange;font-weight:bold;'>flatline</span>, which I drew orange arrows for. The first
|
|
time this clearly happens is in the ***U3*** loadtest. It makes sense to me, because having one flow
|
|
implies only one worker thread, whereas in the ***U2*** loadtest the system can make use of multiple
|
|
receive queues and therefore multiple worker threads. It stands to reason that ***U2*** has a
|
|
slightly better performance than ***U3***.
|
|
|
|
The fourth test, the _MPLS_ loadtest, is forwarding the same identical packets with label 16, out on
|
|
another interface with label 17. They are therefore also single flow, and this explains why the
|
|
***U4*** loadtest looks very similar to the ***U3*** one. Some NICs can hash MPLS traffic to
|
|
multiple receive queues based on the inner payload, but I conclude that the Intel X520-DA2 aka
|
|
82599ES cannot do that.
|
|
|
|
The second block of four are the ***B***_idirectional_ tests. Similar to the tests I did with the
|
|
i226-V 2.5GbE NICs, here each of the network cards has to both receive traffic as well as sent
|
|
traffic. It is with this graph that I can determine the overall throughput in packets/sec
|
|
of these network interfaces. Of course the bits/sec and packets/sec also come from the T-Rex
|
|
loadtester output JSON. Here they are, for the Intel X520-DA2:
|
|
|
|
| ***Intel 82599ES***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|
|
|---------------|----------|-------------|-----------|
|
|
| U1: Unidirectional 1514b | 9.77Gbps | 809kpps | 99.2% |
|
|
| U2: Unidirectional 64b Multi | 6.48Gbps | 13.4Mpps | 90.1% |
|
|
| U3: Unidirectional 64b Single | 3.73Gbps | 7.77Mpps | 52.2% |
|
|
| U4: Unidirectional 64b MPLS | 3.32Gbps | 6.91Mpps | 46.4% |
|
|
| B1: Bidirectional 1514b | 12.9Gbps | 1.07Mpps | 65.6% |
|
|
| B2: Bidirectional 64b Multi | 6.08Gbps | 12.7Mpps | 42.7% |
|
|
| B3: Bidirectional 64b Single | 6.25Gbps | 13.0Mpps | 43.7% |
|
|
| B4: Bidirectional 64b MPLS | 3.26Gbps | 6.79Mpps | 22.8% |
|
|
|
|
A few further observations:
|
|
1. ***U1***'s loadtest shows that the machine can sustain 10Gbps in one direction, while ***B1***
|
|
shows that bidirectional loadtests are not yielding twice as much throughput. This is very
|
|
likely because the PCIe 5.0GT/s x4 link is constrained to 16Gbps total throughput, while the
|
|
OCP NIC supports PCIe 5.0GT/s x8 (32Gbps).
|
|
1. ***U3***'s loadtest shows that one single CPU can do 7.77Mpps max, if it's the only CPU that is
|
|
doing work. This is likely because if it's the only thread doing work, it gets to use the
|
|
entire L2/L3 cache for itself.
|
|
1. ***U2***'s test shows that when multiple workers perform work, the throughput raises to
|
|
13.4Mpps, but this is not double that of a single worker. Similar to before, I think this is
|
|
because the threads now need to share the CPU's modest L2/L3 cache.
|
|
1. ***B3***'s loadtest shows that two CPU threads together can do 6.50Mpps each (for a total of
|
|
13.0Mpps), which I think is likely because each NIC now has to receive _and_ transit packets.
|
|
|
|
If you're reading this and think you have an alternative explanation, do let me know!
|
|
|
|
### Mellanox Cx3 (10GbE)
|
|
|
|
When VPP is doing its work, it typically asks DPDK (or other input types like virtio, AVF, or RDMA)
|
|
for a _list_ of packets, rather than one individual packet. It then brings these packets, called
|
|
_vectors_, through a directed acyclic graph inside of VPP. Each graph node does something specific to
|
|
the packets, for example in `ethernet-input`, the node checks what ethernet type each packet is (ARP,
|
|
IPv4, IPv6, MPLS, ...), and hands them off to the correct next node, such as `ip4-input` or `mpls-input`.
|
|
If VPP is idle, there may be only one or two packets in the list, which means every time the packets
|
|
go into a new node, a new chunk of code has to be loaded from working memory into the CPU's
|
|
instruction cache. Conversely, if there are many packets in the list, only the first packet may need
|
|
to pull things into the i-cache, the second through Nth packet will become cache hits and execute
|
|
_much_ faster. Moreover, some nodes in VPP make use of processor optimizations like _SIMD_ (single
|
|
instruction, multiple data), to save on clock cycles if the same operation needs to be executed
|
|
multiple times.
|
|
|
|
{{< image src="/assets/r86s/grafana-clocks.png" alt="Grafana Clocks" >}}
|
|
|
|
This graph shows the average CPU cycles per packet for each node. In the first three loadtests
|
|
(***U1***, ***U2*** and ***U3***), you can see four lines representing the VPP nodes `ip4-input`
|
|
`ip4-lookup`, `ip4-load-balance` and `ip4-rewrite`. In the fourth loadtest ***U4***, you can see
|
|
only three nodes: `mpls-input`, `mpls-lookup`, and `ip4-mpls-label-disposition-pipe` (where the MPLS
|
|
label '16' is swapped for outgoing label '17').
|
|
|
|
It's clear to me that when VPP has not many packets/sec to route (ie ***U1*** loadtest), that the
|
|
cost _per packet_ is actually quite high at around 200 CPU cycles per packet per node. But, if I
|
|
slam the VPP instance with lots of packets/sec (ie ***U3*** loadtest), that VPP gets _much_ more
|
|
efficient at what it does. What used to take 200+ cycles per packet, now only takes between 34-52
|
|
cycles per packet, which is a whopping 5x increase in efficiency. How cool is that?!
|
|
|
|
And with that, the Mellanox C3 loadtest completes, and the results are in:
|
|
|
|
| ***Mellanox MCX342A-XCCN***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|
|
|---------------|----------|-------------|-----------|
|
|
| U1: Unidirectional 1514b | 9.73Gbps | 805kpps | 99.7% |
|
|
| U2: Unidirectional 64b Multi | 1.11Gbps | 2.30Mpps | 15.5% |
|
|
| U3: Unidirectional 64b Single | 1.10Gbps | 2.27Mpps | 15.3% |
|
|
| U4: Unidirectional 64b MPLS | 1.10Gbps | 2.27Mpps | 15.3% |
|
|
| B1: Bidirectional 1514b | 18.7Gbps | 1.53Mpps | 94.9% |
|
|
| B2: Bidirectional 64b Multi | 1.54Gbps | 2.29Mpps | 7.69% |
|
|
| B3: Bidirectional 64b Single | 1.54Gbps | 2.29Mpps | 7.69% |
|
|
| B4: Bidirectional 64b MPLS | 1.54Gbps | 2.29Mpps | 7.69% |
|
|
|
|
Here's something that I find strange though. VPP is clearly not saturated by these 64b loadtests. I
|
|
know this, because in the case of the Intel X520-DA2 above, I could easily see 13Mpps in a
|
|
bidirectional test, yet with this Mellanox Cx3 card, no matter if I do one direction or both
|
|
directions, the max packets/sec tops at 2.3Mpps only -- that's an order of magnitude lower.
|
|
|
|
Looking at VPP, both worker threads (the one reading from Port 5/0/0, and the other reading from
|
|
Port 5/0/1), are not very busy at all. If a VPP worker thread is saturated, this typically shows as
|
|
a vectors/call of 256.00 and 100% of CPU cycles consumed. But here, that's not the case at all, and
|
|
most time is spent in DPDK waiting for traffic:
|
|
|
|
```
|
|
Thread 1 vpp_wk_0 (lcore 1)
|
|
Time 31.2, 10 sec internal node vector rate 2.26 loops/sec 988626.15
|
|
vector rates in 1.1521e6, out 1.1521e6, drop 0.0000e0, punt 0.0000e0
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
FiftySixGigabitEthernet5/0/1-o active 15949560 35929200 0 8.39e1 2.25
|
|
FiftySixGigabitEthernet5/0/1-t active 15949560 35929200 0 2.59e2 2.25
|
|
dpdk-input polling 36250611 35929200 0 6.55e2 .99
|
|
ethernet-input active 15949560 35929200 0 2.69e2 2.25
|
|
ip4-input-no-checksum active 15949560 35929200 0 1.01e2 2.25
|
|
ip4-load-balance active 15949560 35929200 0 7.64e1 2.25
|
|
ip4-lookup active 15949560 35929200 0 9.26e1 2.25
|
|
ip4-rewrite active 15949560 35929200 0 9.28e1 2.25
|
|
unix-epoll-input polling 35367 0 0 1.29e3 0.00
|
|
---------------
|
|
Thread 2 vpp_wk_1 (lcore 2)
|
|
Time 31.2, 10 sec internal node vector rate 2.43 loops/sec 659534.38
|
|
vector rates in 1.1517e6, out 1.1517e6, drop 0.0000e0, punt 0.0000e0
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
FiftySixGigabitEthernet5/0/0-o active 14845221 35913927 0 8.66e1 2.42
|
|
FiftySixGigabitEthernet5/0/0-t active 14845221 35913927 0 2.72e2 2.42
|
|
dpdk-input polling 23114538 35913927 0 6.99e2 1.55
|
|
ethernet-input active 14845221 35913927 0 2.65e2 2.42
|
|
ip4-input-no-checksum active 14845221 35913927 0 9.73e1 2.42
|
|
ip4-load-balance active 14845220 35913923 0 7.17e1 2.42
|
|
ip4-lookup active 14845221 35913927 0 9.03e1 2.42
|
|
ip4-rewrite active 14845221 35913927 0 8.97e1 2.42
|
|
unix-epoll-input polling 22551 0 0 1.37e3 0.00
|
|
```
|
|
|
|
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
|
|
|
I kind of wonder why that is. Is the Mellanox Connect-X3 such a poor performer? Or does it not like
|
|
small packets? I've read online that Mellanox cards do some form of message compression on the PCI
|
|
bus, something perhaps to turn off. I don't know, but I don't like it!
|
|
|
|
|
|
### Mellanox Cx5 (25GbE)
|
|
|
|
VPP has a few _polling_ nodes, which are pieces of code that execute back-to-back in a tight
|
|
execution loop. A classic example of a _polling_ node is a _Poll Mode Driver_ from DPDK: this will
|
|
ask the network cards if they have any packets, and if so: marshall them through the directed graph
|
|
of VPP. As soon as that's done, the node will immediately ask again. If there is no work to do, this
|
|
turns into a tight loop with DPDK continuously asking for work. There is however another, lesser
|
|
known, _polling_ node: `unix-epoll-input`. This node services a local pool of file descriptors, like
|
|
the _Linux Control Plane_ netlink socket for example, or the clients attached to the Statistics
|
|
segment, CLI or API. You can see the open files with `show unix files`.
|
|
|
|
This design explains why the CPU load of a typical DPDK application is 100% of each worker thread. As
|
|
an aside, you can ask the PMD to start off in _interrupt_ mode, and only after a certain load switch
|
|
seemlessly to _polling_ mode. Take a look at `set interface rx-mode` on how to change from _polling_
|
|
to _interrupt_ or _adaptive_ modes. For performance reasons, I always leave the node in _polling_
|
|
mode (the default in VPP).
|
|
|
|
{{< image src="/assets/r86s/grafana-cpu.png" alt="Grafana CPU" >}}
|
|
|
|
The stats segment shows how many clock cycles are being spent in each call of each node. It also
|
|
knows how often nodes are called. Considering the `unix-epoll-input` and `dpdk-input` nodes will
|
|
perform what is essentially a tight-loop, the CPU should always add up to 100%. I found that one
|
|
cool way to show how busy a VPP instance really is, is to look over all CPU threads, and sort
|
|
through the fraction of time spent in each node:
|
|
|
|
* ***Input Nodes***: are those which handle the receive path from DPDK and into the directed graph
|
|
for routing -- for example `ethernet-input`, then `ip4-input` through to `ip4-lookup` and finally
|
|
`ip4-rewrite`. This is where VPP usually spends most of its CPU cycles.
|
|
* ***Output Nodes***: are those which handle the transmit path into DPDK. You'll see these are
|
|
nodes whose name ends in `-output` or `-tx`. You can also see that in ***U2***, there are only
|
|
two nodes consuming CPU, while in ***B2*** there are four nodes (because two interfaces are
|
|
transmitting!)
|
|
* ***epoll***: the _polling_ node called `unix-epoll-input` depicted in <span
|
|
style='color:brown;font-weight:bold'>brown</span> in this graph.
|
|
* ***dpdk***: the _polling_ node called `dpdk-input` depicted in <span
|
|
style='color:green;font-weight:bold'>green</span> in this graph.
|
|
|
|
If there is no work to do, as was the case at around 20:30 in the graph above, the _dpdk_ and
|
|
_epoll_ nodes are the only two that are consuming CPU. If there's lots of work to do, as was the
|
|
case in the unidirectional 64b loadtest between 19:40-19:50, and the bidirectional 64b loadtest
|
|
between 20:45-20:55, I can observe lots of other nodes doing meaningful work, ultimately starving
|
|
the _dpdk_ and _epoll_ threads until an equilibrium is achieved. This is how I know the VPP process
|
|
is the bottleneck and not, for example, the PCI bus.
|
|
|
|
I let the eight loadtests run, and make note of the bits/sec and packets/sec for each, in this table
|
|
for the Mellanox Cx5:
|
|
|
|
| ***Mellanox MCX542_ACAT***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|
|
|---------------|----------|-------------|-----------|
|
|
| U1: Unidirectional 1514b | 24.2Gbps | 2.01Mpps | 98.6% |
|
|
| U2: Unidirectional 64b Multi | 7.43Gbps | 15.5Mpps | 41.6% |
|
|
| U3: Unidirectional 64b Single | 3.52Gbps | 7.34Mpps | 19.7% |
|
|
| U4: Unidirectional 64b MPLS | 7.34Gbps | 15.3Mpps | 46.4% |
|
|
| B1: Bidirectional 1514b | 24.9Gbps | 2.06Mpps | 50.4% |
|
|
| B2: Bidirectional 64b Multi | 6.58Gbps | 13.7Mpps | 18.4% |
|
|
| B3: Bidirectional 64b Single | 3.15Gbps | 6.55Mpps | 8.81% |
|
|
| B4: Bidirectional 64b MPLS | 6.55Gbps | 13.6Mpps | 18.3% |
|
|
|
|
Some observations:
|
|
1. This Mellanox Cx5 runs quite a bit hotter than the other two cards. It's a PCIe v3.0 which means
|
|
that despite there only being 4 lanes to the OCP port, it can achieve 31.504 Gbit/s (in case
|
|
you're wondering, this is 128b/130b encoding on 8.0GT/s x4).
|
|
1. It easily saturates 25Gbit in one direction with big packets in ***U1***, but as soon as smaller
|
|
packets are offered, each worker thread tops out at 7.34Mpps or so in ***U2***.
|
|
1. When testing in both directions, each thread can do about 6.55Mpps or so in ***B2***. Similar to
|
|
the other NICs, there is a clear slowdown due to CPU cache contention (when using multiple
|
|
threads), and RX/TX simultaneously (when doing bidirectional tests).
|
|
1. MPLS is a lot faster -- nearly double based on the use of multiple threads. I think this is
|
|
because the Cx5 has a hardware hashing function for MPLS packets that looks at the inner
|
|
payload to sort the traffic into multiple queues, while the Cx3 and Intel X520-DA2 do not.
|
|
|
|
## Summary and closing thoughts
|
|
|
|
There's a lot to say about these OCP cards. While the Intel is cheap, the Mellanox Cx3 is a bit
|
|
quirky with its VPP enumeration, and the Mellanox Cx5 is a bit more expensive (and draws a fair bit
|
|
more power, coming in at 20W), but it do 25Gbit reasonably, it's pretty difficult to make a solid
|
|
recommendation. What I find interesting is the very low limit in packets/sec on 64b packets coming
|
|
from the Cx3, while at the same time there seems to be an added benefit in MPLS hashing that the
|
|
other two cards do not have.
|
|
|
|
All things considered, I think I would recommend the Intel x520-DA2 (based on the _Niantic_ chip,
|
|
Intel 82599ES, total machine coming in at 17W). It seems like it pairs best with the available CPU
|
|
on the machine. Maybe a Mellanox ConnectX-4 could be a good alternative though, hmmmm :)
|
|
|
|
Here's a few files I gathered along the way, in case they are useful:
|
|
|
|
* [[LSCPU](/assets/r86s/lscpu.txt)] - [[Likwid Topology](/assets/r86s/likwid-topology.txt)] -
|
|
[[DMI Decode](/assets/r86s/dmidecode.txt)] - [[LSBLK](/assets/r86s/lsblk.txt)]
|
|
* Mellanox Cx341: [[dmesg](/assets/r86s/dmesg-cx3.txt)] - [[LSPCI](/assets/r86s/lspci-cx3.txt)] -
|
|
[[LSHW](/assets/r86s/lshw-cx3.txt)] - [[VPP Patch](/assets/r86s/vpp-cx3.patch)]
|
|
* Mellanox Cx542: [[dmesg](/assets/r86s/dmesg-cx5.txt)] - [[LSPCI](/assets/r86s/lspci-cx5.txt)] -
|
|
[[LSHW](/assets/r86s/lshw-cx5.txt)]
|
|
* Intel X520-DA2: [[dmesg](/assets/r86s/dmesg-x520.txt)] - [[LSPCI](/assets/r86s/lspci-x520.txt)] -
|
|
[[LSHW](/assets/r86s/lshw-x520.txt)]
|
|
* VPP Configs: [[startup.conf](/assets/r86s/vpp/startup.conf)] - [[L2 Config](/assets/r86s/vpp/config/l2.vpp)] -
|
|
[[L3 Config](/assets/r86s/vpp/config/l3.vpp)] - [[MPLS Config](/assets/r86s/vpp/config/mpls.vpp)]
|
|
|