Rewrite all images to Hugo format
This commit is contained in:
566
content/articles/2024-07-05-r86s.md
Normal file
566
content/articles/2024-07-05-r86s.md
Normal file
@ -0,0 +1,566 @@
|
||||
---
|
||||
date: "2024-07-05T12:51:23Z"
|
||||
title: 'Review: R86S (Jasper Lake - N6005)'
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
||||
{{< image width="250px" float="right" src="/assets/r86s/r86s-front.png" alt="R86S Front" >}}
|
||||
|
||||
I am always interested in finding new hardware that is capable of running VPP. Of course, a standard
|
||||
issue 19" rack mountable machine like a Dell, HPE or SuperMicro machine is an obvious choice. They
|
||||
come with redundant power supplies, PCIe v3.0 or better expansion slots, and can boot off of mSATA
|
||||
or NVME, with plenty of RAM. But for some people and in some locations, the power envelope or
|
||||
size/cost of these 19" rack mountable machines can be prohibitive. Sometimes, just having a smaller
|
||||
form factor can be very useful: \
|
||||
***Enter the GoWin R86S!***
|
||||
|
||||
{{< image width="250px" float="right" src="/assets/r86s/r86s-nvme.png" alt="R86S NVME" >}}
|
||||
|
||||
I stumbled across this lesser known build from GoWin, which is an ultra compact but modern design,
|
||||
featuring three 2.5GbE ethernet ports and optionally two 10GbE, or as I'll show here, two 25GbE
|
||||
ports. What I really liked about the machine is that it comes with 32GB of LPDDR4 memory and can boot
|
||||
off of an m.2 NVME -- which makes it immediately an appealing device to put in the field. I
|
||||
noticed that the height of the machine is just a few millimeters smaller than 1U which is 1.75"
|
||||
(44.5mm), which gives me the bright idea to 3D print a bracket to be able to rack these and because
|
||||
they are very compact -- a width of 78mm only, I can manage to fit four of them in one 1U front, or
|
||||
maybe a Mikrotik CRS305 breakout switch. Slick!
|
||||
|
||||
{{< image width="250px" float="right" src="/assets/r86s/r86s-ocp.png" alt="R86S OCP" >}}
|
||||
|
||||
I picked up two of these _R86S Pro_ and when they arrived, I noticed that their 10GbE is actually
|
||||
an _Open Compute Project_ (OCP) footprint expansion card, which struck me as clever. It means that I
|
||||
can replace the Mellanox `CX342A` network card with perhaps something more modern, such as an Intel
|
||||
`X520-DA2` or Mellanox `MCX542B_ACAN` which is even dual-25G! So I take to ebay and buy myself a few
|
||||
expansion OCP boards, which are surprisingly cheap, perhaps because the OCP form factor isn't as
|
||||
popular as 'normal' PCIe v3.0 cards.
|
||||
|
||||
I put a Google photos album online [[here](https://photos.app.goo.gl/gPMAp21FcXFiuNaH7)], in case
|
||||
you'd like some more detailed shots.
|
||||
|
||||
In this article, I'll write about a mixture of hardware, systems engineering (how the hardware like
|
||||
network cards and motherboard and CPU interact with one another), and VPP performance diagnostics.
|
||||
I hope that it helps a few wary Internet denizens feel their way around these challenging but
|
||||
otherwise fascinating technical topics. Ready? Let's go!
|
||||
|
||||
# Hardware Specs
|
||||
|
||||
{{< image width="250px" float="right" src="/assets/r86s/nics.png" alt="NICs" >}}
|
||||
|
||||
For the CHF 314,- I paid for each Intel Pentium N6005, this machine is delightful! They feature:
|
||||
|
||||
* Intel Pentium Silver N6005 @ 2.00GHz (4 cores)
|
||||
* 2x16GB Micron LPDDR4 memory @2933MT/s
|
||||
* 1x Samsung SSD 980 PRO 1TB NVME
|
||||
* 3x Intel I226-V 2.5GbE network ports
|
||||
* 1x OCP v2.0 connector with PCIe v3.0 x4 delivered
|
||||
* USB-C power supply
|
||||
* 2x USB3 (one on front, one on side)
|
||||
* 1x USB2 (on the side)
|
||||
* 1x MicroSD slot
|
||||
* 1x MicroHDMI video out
|
||||
* Wi-Fi 6 AX201 160MHz onboard
|
||||
|
||||
To the right I've put the three OCP nework interface cards side by side. On the top, the Mellanox
|
||||
Cx3 (2x10G) that shipped with the R86S units. In the middle, a spiffy Mellanox Cx5 (2x25G), and at
|
||||
the bottom, the _classic_ Intel 82599ES (2x10G) card. As I'll demonstrate, despite having the same
|
||||
form factor, each of these have a unique story to tell, well beyond their rated portspeed.
|
||||
|
||||
There's quite a few options for CPU out there - GoWin sells them with Jasper Lake (Celeron N5105
|
||||
or Pentium N6005, the one I bought), but also with newer Alder Lake (N100 or N305). Price,
|
||||
performance and power draw will vary. I looked at a few differences in Passmark, and I think I made
|
||||
a good trade off between cost, power and performance. You may of course choose differently!
|
||||
|
||||
The R86S formfactor is very compact, coming in at (80mm x 120mm x 40mm), and the case is made of
|
||||
sturdy aluminium. It feels like a good quality build, and the inside is also pretty neat. In the
|
||||
kit, a cute little M2 hex driver is included. This allows me to remove the bottom plate (to service
|
||||
the NVME) and separate the case to access the OCP connector (and replace the NIC!). Finally, the two
|
||||
antennae at the back are tri-band, suitable for WiFi 6. There is one fan included in the chassis,
|
||||
with a few cut-outs in the top of the case, to let the air flow through the case. The fan is not
|
||||
noisy, but definitely noticeable.
|
||||
|
||||
## Compiling VPP on R86S
|
||||
|
||||
I first install Debian Bookworm on them, and retrofit one of them with the Intel X520 and the other
|
||||
with the Mellanox Cx5 network cards. While the Mellanox Cx342A that comes with the R86S does have
|
||||
DPDK support (using the MLX4 poll mode driver), it has a quirk in that it does not enumerate both
|
||||
ports as unique PCI devices, causing VPP to crash with duplicate graph node names:
|
||||
|
||||
```
|
||||
vlib_register_node:418: more than one node named `FiftySixGigabitEthernet5/0/0-tx'
|
||||
Failed to save post-mortem API trace to /tmp/api_post_mortem.794
|
||||
received signal SIGABRT, PC 0x7f9445aa9e2c
|
||||
```
|
||||
|
||||
The way VPP enumerates DPDK devices is by walking the PCI bus, but considering the Connect-X3 has
|
||||
two ports behind the same PCI address, it'll try to create two interfaces, which fails. It's pretty
|
||||
easily fixable with a small [[patch](/assets/r86s/vpp-cx3.patch)]. Off I go, to compile VPP (version
|
||||
`24.10-rc0~88-ge3469369dd`) with Mellanox DPDK support, to get the best side by side comparison
|
||||
between the Cx3 and X520 cards on the one hand needing DPDK, and the Cx5 card optionally also being
|
||||
able to use VPP's RDMA driver. They will _all_ be using DPDK in my tests.
|
||||
|
||||
I'm not out of the woods yet, because VPP throws an error when enumerating and attaching the
|
||||
Mellanox Cx342. I read the DPDK documentation for this poll mode driver
|
||||
[[ref](https://doc.dpdk.org/guides/nics/mlx4.html)] and find that when using DPDK applications, the
|
||||
`mlx4_core` driver in the kernel has to be initialized with a specific flag, like so:
|
||||
|
||||
```
|
||||
GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=1-3 iommu=on intel_iommu=on mlx4_core.log_num_mgm_entry_size=-1"
|
||||
```
|
||||
|
||||
And because I'm using `iommu`, the correct driver to load for Cx3 is `vfio_pci`, so I put that in
|
||||
`/etc/modules`, rebuild the initrd, and reboot the machine. With all of that sleuthing out of the
|
||||
way, I am now ready to take the R86S out for a spin and see how much this little machine is capable
|
||||
of forwarding as a router.
|
||||
|
||||
### Power: Idle and Under Load
|
||||
I note that the Intel Pentium Silver CPU has 4 cores, one of which will be used by OS and
|
||||
controlplane, leaving 3 worker threads left for VPP. The Pentium Silver N6005 comes with 32kB of L1
|
||||
per core, and 1.5MB of L2 + 4MB of L3 cache shared between the cores. It's not much, but then again
|
||||
the TDP is shockingly low 10 Watts. Before VPP runs (and makes the CPUs work really hard), the
|
||||
entire machine idles at 12 Watts. When powered on under full load, the Mellanox Cx3 and Intel
|
||||
x520-DA2 both sip 17 Watts of power and the Mellanox Cx5 slurps 20 Watts of power all-up. Neat!
|
||||
|
||||
## Loadtest Results
|
||||
|
||||
{{< image width="400px" float="right" src="/assets/r86s/loadtest.png" alt="Loadtest" >}}
|
||||
|
||||
For each network interface I will do a bunch of loadtests, to show different aspects of the setup.
|
||||
First, I'll do a bunch of unidirectional tests, where traffic goes into one port and exits another.
|
||||
I will do this with either large packets (1514b), small packets (64b) but many flows, which allow me
|
||||
to use multiple hardware receive queues assigned to individual worker threads, or small packets with
|
||||
only one flow, limiting VPP to only one RX queue and consequently only one CPU thread. Because I
|
||||
think it's hella cool, I will also loadtest MPLS label switching (eg. MPLS frame with label '16' on
|
||||
ingress, forwarded with a swapped label '17' on egress). In general, MPLS lookups can be a bit
|
||||
faster as they are (constant time) hashtable lookups, while IPv4 longest prefix match lookups use a
|
||||
trie. MPLS won't be significantly faster than IPv4 in these tests, because the FIB is tiny with
|
||||
only a handful of entries.
|
||||
|
||||
Second, I'll do the same loadtests but in both directions, which means traffic is both entering NIC0
|
||||
and being emitted on NIC1, but also entering on NIC1 to be emitted on NIC0. In these loadtests,
|
||||
again large packets, small packets multi-flow, small packets single-flow, and MPLS, the network chip
|
||||
has to do more work to maintain its RX queues *and* its TX queues simultaneously. As I'll
|
||||
demonstrate, this tends to matter quite a bit on consumer hardware.
|
||||
|
||||
### Intel i226-V (2.5GbE)
|
||||
|
||||
This is a 2.5G network interface from the _Foxville_ family, released in Q2 2022 with a ten year
|
||||
expected availability, it's currently a very good choice. It is a consumer/client chip, which means
|
||||
I cannot expect super performance from it. In this machine, the three RJ45 ports are connected to
|
||||
PCI slot 01:00.0, 02:00.0 and 03:00.0, each at 5.0GT/s (this means they are PCIe v2.0) and they take
|
||||
one x1 PCIe lane to the CPU. I leave the first port as management, and take the second+third one and
|
||||
give them to VPP like so:
|
||||
|
||||
```
|
||||
dpdk {
|
||||
dev 0000:02:00.0 { name e0 }
|
||||
dev 0000:03:00.0 { name e1 }
|
||||
no-multi-seg
|
||||
decimal-interface-names
|
||||
uio-driver vfio-pci
|
||||
}
|
||||
```
|
||||
|
||||
The logical configuration then becomes:
|
||||
|
||||
```
|
||||
set int state e0 up
|
||||
set int state e1 up
|
||||
set int ip address e0 100.64.1.1/30
|
||||
set int ip address e1 100.64.2.1/30
|
||||
ip route add 16.0.0.0/24 via 100.64.1.2
|
||||
ip route add 48.0.0.0/24 via 100.64.2.2
|
||||
ip neighbor e0 100.64.1.2 50:7c:6f:20:30:70
|
||||
ip neighbor e1 100.64.2.2 50:7c:6f:20:30:71
|
||||
|
||||
mpls table add 0
|
||||
set interface mpls e0 enable
|
||||
set interface mpls e1 enable
|
||||
mpls local-label add 16 eos via 100.64.2.2 e1
|
||||
mpls local-label add 17 eos via 100.64.1.2 e0
|
||||
```
|
||||
|
||||
In the first block, I'll bring up interfaces `e0` and `e1`, give them an IPv4 address in a /30
|
||||
transit net, and set a route to the other side. I'll route packets destined to 16.0.0.0/24 to the
|
||||
Cisco T-Rex loadtester at 100.64.1.2, and I'll route packets for 48.0.0.0/24 to the T-Rex at
|
||||
100.64.2.2. To avoid the need to ARP for T-Rex, I'll set some static ARP entries to the loadtester's
|
||||
MAC addresses.
|
||||
|
||||
In the second block, I'll enable MPLS, turn it on on the two interfaces, and add two FIB entries. If
|
||||
VPP receives an MPLS packet with label 16, it'll forward it on to Cisco T-Rex on port `e1`, and if it
|
||||
receives a packet with label 17, it'll forward it to T-Rex on port `e0`.
|
||||
|
||||
Without further ado, here are the results of the i226-V loadtest:
|
||||
|
||||
| ***Intel i226-V***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|
||||
|---------------|----------|-------------|-----------|
|
||||
| Unidirectional 1514b | 2.44Gbps | 202kpps | 99.4% |
|
||||
| Unidirectional 64b Multi | 1.58Gbps | 3.28Mpps | 88.1% |
|
||||
| Unidirectional 64b Single | 1.58Gbps | 3.28Mpps | 88.1% |
|
||||
| Unidirectional 64b MPLS | 1.57Gbps | 3.27Mpps | 87.9% |
|
||||
| Bidirectional 1514b | 4.84Gbps | 404kpps | 99.4% |
|
||||
| Bidirectional 64b Multi | 2.44Gbps | 5.07Mpps | 68.2% |
|
||||
| Bidirectional 64b Single | 2.44Gbps | 5.07Mpps | 68.2% |
|
||||
| Bidirectional 64b MPLS | 2.43Gbps | 5.07Mpps | 68.2% |
|
||||
|
||||
First response: very respectable!
|
||||
|
||||
#### Important Notes
|
||||
|
||||
**1. L1 vs L2** \
|
||||
There's a few observations I want to make, as these numbers can be confusing. First off, VPP when
|
||||
given large packets, can easily sustain almost exactly (!) the line rate of 2.5GbE. There's always a
|
||||
debate about these numbers, so let me offer offer some theoretical background --
|
||||
|
||||
1. The L2 Ethernet frame that Cisco T-Rex sends consists of the source/destination MAC (6
|
||||
bytes each), a type (2 bytes), the payload, and a frame checksum (4 bytes). It shows us this
|
||||
number as `Tx bps L2`.
|
||||
1. But on the wire, the PHY has to additionally send a _preamble_ (7 bytes), a _start frame
|
||||
delimiter_ (1 byte), and at the end, an _interpacket gap_ (12 bytes), which is 20 bytes of
|
||||
overhead. This means that the total size on the wire will be **1534 bytes**. It shows us this
|
||||
number as `Tx bps L1`.
|
||||
1. This 1534 byte L1 frame on the wire is 12272 bits. For a 2.5Gigabit line rate, this means we
|
||||
can send at most 2'500'000'000 / 12272 = **203715 packets per second**. Regardless of L1 or L2,
|
||||
this number is always `Tx pps`.
|
||||
1. The smallest (L2) Ethernet frame we're allowed to send, is 64 bytes, and anything shorter than
|
||||
this is called a _Runt_. On the wire, such a frame will be 84 bytes (672 bits). With 2.5GbE, this
|
||||
means **3.72Mpps** is the theoretical maximum.
|
||||
|
||||
When reading back loadtest results from Cisco T-Rex, it shows us packets per second (Rx pps), but it
|
||||
only shows us the `Rx bps`, which is the **L2 bits/sec** which corresponds to the sending port's `Tx
|
||||
bps L2`. When I describe the percentage of Line-Rate, I calculate this with what physically fits on
|
||||
the wire, eg the **L1 bits/sec**, because that makes most sense to me.
|
||||
|
||||
When sending small 64b packets, the difference is significant: taking the above _Unidirectional 64b
|
||||
Single_ as an example, I observed 3.28M packets/sec. This is a bandwidth of 3.28M\*64\*8 = 1.679Gbit
|
||||
of L2 traffic, but a bandwidth of 3.28M\*(64+20)\*8 = 2.204Gbit of L1 traffic, which is how I
|
||||
determine that it is 88.1% of Line-Rate.
|
||||
|
||||
**2. One RX queue** \
|
||||
A less pedantic observation is that there is no difference between _Multi_ and _Single_ flow
|
||||
loadtests. This is because the NIC only uses one RX queue, and therefor only one VPP worker thread.
|
||||
I did do a few loadtests with multiple receive queues, but it does not matter for performance. When
|
||||
performing this 3.28Mpps of load, I can see that VPP itself is not saturated. I can see that most of
|
||||
the time it's just sitting there waiting for DPDK to give it work, which manifests as a vectors/call
|
||||
relatively low:
|
||||
|
||||
```
|
||||
---------------
|
||||
Thread 2 vpp_wk_1 (lcore 2)
|
||||
Time 10.9, 10 sec internal node vector rate 40.39 loops/sec 68325.87
|
||||
vector rates in 3.2814e6, out 3.2814e6, drop 0.0000e0, punt 0.0000e0
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
dpdk-input polling 61933 2846586 0 1.28e2 45.96
|
||||
ethernet-input active 61733 2846586 0 1.71e2 46.11
|
||||
ip4-input-no-checksum active 61733 2846586 0 6.54e1 46.11
|
||||
ip4-load-balance active 61733 2846586 0 4.70e1 46.11
|
||||
ip4-lookup active 61733 2846586 0 7.50e1 46.11
|
||||
ip4-rewrite active 61733 2846586 0 7.23e1 46.11
|
||||
e1-output active 61733 2846586 0 2.53e1 46.11
|
||||
e1-tx active 61733 2846586 0 1.38e2 46.11
|
||||
```
|
||||
|
||||
By the way the other numbers here are fascinating as well. Take a look at them:
|
||||
* **Calls**: How often has VPP executed this graph node.
|
||||
* **Vectors**: How many packets (which are internally called vectors) have been handled.
|
||||
* **Vectors/Call**: Every time VPP executes the graph node, on average how many packets are done
|
||||
at once? An unloaded VPP will hover around 1.00, and the maximum permissible is 256.00.
|
||||
* **Clocks**: How many CPU cycles, on average, did each packet spend in each graph node.
|
||||
Interestingly, summing up this number gets very close to the total CPU clock cycles available
|
||||
(on this machine 2.4GHz).
|
||||
|
||||
Zooming in on the **clocks** number a bit more: every time a packet was handled, roughly 594 CPU
|
||||
cycles were spent in VPP's directed graph. An additional 128 CPU cycles were spent asking DPDK for
|
||||
work. Summing it all up, 3.28M\*(594+128) = 2'369'170'800 which is earily close to the 2.4GHz I
|
||||
mentioned above. I love it when the math checks out!!
|
||||
|
||||
By the way, in case you were wondering what happens on an unloaded VPP thread, the clocks spent
|
||||
in `dpdk-input` (and other polling nodes like `unix-epoll-input`) just go up to consume the whole
|
||||
core. I explain that in a bit more detail below.
|
||||
|
||||
**3. Uni- vs Bidirectional** \
|
||||
I noticed a non-linear response between loadtests in one direction versus both directions. At large
|
||||
packets, it did not matter. Both directions satured the line nearly perfectly (202kpps in one
|
||||
direction, and 404kpps in both directions). However, in the smaller packets, some contention became
|
||||
clear. In only one direction, IPv4 and MPLS forwarding were roughly 3.28Mpps; but in both
|
||||
directions, this went down to 2.53Mpps in each direction (which is my reported 5.07Mpps). So it's
|
||||
interesting to see how these i226-V chips do seem to care if they are only receiving or transmitting
|
||||
transmitting, or performing both receiving *and* transmitting.
|
||||
|
||||
### Intel X520 (10GbE)
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
|
||||
|
||||
This network card is based on the classic Intel _Niantic_ chipset, also known as the 82599ES chip,
|
||||
first released in 2009. It's super reliable, but there is one downside. It's a PCIe v2.0 device
|
||||
(5.0GT/s) and to be able to run two ports, it will need eight lanes of PCI connectivity. However, a
|
||||
quick inspection using `dmesg` shows me, that there are only 4 lanes brought to the OCP connector:
|
||||
|
||||
```
|
||||
ixgbe 0000:05:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x4 link
|
||||
at 0000:00:1c.4 (capable of 32.000 Gb/s with 5.0 GT/s PCIe x8 link)
|
||||
ixgbe 0000:05:00.0: MAC: 2, PHY: 1, PBA No: H31656-000
|
||||
ixgbe 0000:05:00.0: 90:e2:ba:c5:c9:38
|
||||
ixgbe 0000:05:00.0: Intel(R) 10 Gigabit Network Connection
|
||||
```
|
||||
|
||||
That's a bummer! Because there are two Tengig ports on this OCP, and this chip is a PCIe v2.0 device
|
||||
which means the PCI encoding will be 8b/10b which means each lane can deliver about 80% of the
|
||||
5.0GT/s, and 80% of 20GT/s is 16.0Gbit. By the way, when PCIe v3.0 was released, not only did the
|
||||
transfer speed go to 8.0GT/s per lane, the encoding also changed to 128b/130b which lowers the
|
||||
overhead from a whopping 20% to only 1.5%. It's not a bad investment of time to read up on PCI
|
||||
Express standards on [[Wikipedia](https://en.wikipedia.org/wiki/PCI_Express)], as PCIe limitations
|
||||
and blocked lanes (like in this case!) are the number one reason for poor VPP performance, as my
|
||||
buddy Sander also noted during my NLNOG talk last year.
|
||||
|
||||
#### Intel X520: Loadtest Results
|
||||
|
||||
Now that I've shown a few of these runtime statistics, I think it's good to review three pertinent graphs.
|
||||
I proceed to hook up the loadtester to the 10G ports of the R86S unit that has the Intel X520-DA2
|
||||
adapter. I'll run the same eight loadtests:
|
||||
**{1514b,64b,64b-1Q,MPLS} x {unidirectional,bidirectional}**
|
||||
|
||||
In the table above, I showed the output of `show runtime` in the VPP debug CLI. These numbers are
|
||||
also exported in a prometheus exporter. I wrote about that in this [[article]({% post_url
|
||||
2023-04-09-vpp-stats %})]. In Grafana, I can draw these timeseries as graphs, and it shows me a lot
|
||||
about where VPP is spending its time. Each _node_ in the directed graph counts how many vectors
|
||||
(packets) it has seen, and how many CPU cycles it has spent doing its work.
|
||||
|
||||
{{< image src="/assets/r86s/grafana-vectors.png" alt="Grafana Vectors" >}}
|
||||
|
||||
In VPP, a graph of _vectors/sec_ means how many packets per second is the router forwarding. The
|
||||
graph above is on a logarithmic scale, and I've annotated each of the eight loadtests in orange. The
|
||||
first block of four are the ***U***_nidirectional_ tests and of course, higher values is better.
|
||||
|
||||
I notice that some of these loadtests ramp up until a certain point, after which they become a <span
|
||||
style='color:orange;font-weight:bold;'>flatline</span>, which I drew orange arrows for. The first
|
||||
time this clearly happens is in the ***U3*** loadtest. It makes sense to me, because having one flow
|
||||
implies only one worker thread, whereas in the ***U2*** loadtest the system can make use of multiple
|
||||
receive queues and therefore multiple worker threads. It stands to reason that ***U2*** has a
|
||||
slightly better performance than ***U3***.
|
||||
|
||||
The fourth test, the _MPLS_ loadtest, is forwarding the same identical packets with label 16, out on
|
||||
another interface with label 17. They are therefore also single flow, and this explains why the
|
||||
***U4*** loadtest looks very similar to the ***U3*** one. Some NICs can hash MPLS traffic to
|
||||
multiple receive queues based on the inner payload, but I conclude that the Intel X520-DA2 aka
|
||||
82599ES cannot do that.
|
||||
|
||||
The second block of four are the ***B***_idirectional_ tests. Similar to the tests I did with the
|
||||
i226-V 2.5GbE NICs, here each of the network cards has to both receive traffic as well as sent
|
||||
traffic. It is with this graph that I can determine the overall throughput in packets/sec
|
||||
of these network interfaces. Of course the bits/sec and packets/sec also come from the T-Rex
|
||||
loadtester output JSON. Here they are, for the Intel X520-DA2:
|
||||
|
||||
| ***Intel 82599ES***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|
||||
|---------------|----------|-------------|-----------|
|
||||
| U1: Unidirectional 1514b | 9.77Gbps | 809kpps | 99.2% |
|
||||
| U2: Unidirectional 64b Multi | 6.48Gbps | 13.4Mpps | 90.1% |
|
||||
| U3: Unidirectional 64b Single | 3.73Gbps | 7.77Mpps | 52.2% |
|
||||
| U4: Unidirectional 64b MPLS | 3.32Gbps | 6.91Mpps | 46.4% |
|
||||
| B1: Bidirectional 1514b | 12.9Gbps | 1.07Mpps | 65.6% |
|
||||
| B2: Bidirectional 64b Multi | 6.08Gbps | 12.7Mpps | 42.7% |
|
||||
| B3: Bidirectional 64b Single | 6.25Gbps | 13.0Mpps | 43.7% |
|
||||
| B4: Bidirectional 64b MPLS | 3.26Gbps | 6.79Mpps | 22.8% |
|
||||
|
||||
A few further observations:
|
||||
1. ***U1***'s loadtest shows that the machine can sustain 10Gbps in one direction, while ***B1***
|
||||
shows that bidirectional loadtests are not yielding twice as much throughput. This is very
|
||||
likely because the PCIe 5.0GT/s x4 link is constrained to 16Gbps total throughput, while the
|
||||
OCP NIC supports PCIe 5.0GT/s x8 (32Gbps).
|
||||
1. ***U3***'s loadtest shows that one single CPU can do 7.77Mpps max, if it's the only CPU that is
|
||||
doing work. This is likely because if it's the only thread doing work, it gets to use the
|
||||
entire L2/L3 cache for itself.
|
||||
1. ***U2***'s test shows that when multiple workers perform work, the throughput raises to
|
||||
13.4Mpps, but this is not double that of a single worker. Similar to before, I think this is
|
||||
because the threads now need to share the CPU's modest L2/L3 cache.
|
||||
1. ***B3***'s loadtest shows that two CPU threads together can do 6.50Mpps each (for a total of
|
||||
13.0Mpps), which I think is likely because each NIC now has to receive _and_ transit packets.
|
||||
|
||||
If you're reading this and think you have an alternative explanation, do let me know!
|
||||
|
||||
### Mellanox Cx3 (10GbE)
|
||||
|
||||
When VPP is doing its work, it typically asks DPDK (or other input types like virtio, AVF, or RDMA)
|
||||
for a _list_ of packets, rather than one individual packet. It then brings these packets, called
|
||||
_vectors_, through a directed acyclic graph inside of VPP. Each graph node does something specific to
|
||||
the packets, for example in `ethernet-input`, the node checks what ethernet type each packet is (ARP,
|
||||
IPv4, IPv6, MPLS, ...), and hands them off to the correct next node, such as `ip4-input` or `mpls-input`.
|
||||
If VPP is idle, there may be only one or two packets in the list, which means every time the packets
|
||||
go into a new node, a new chunk of code has to be loaded from working memory into the CPU's
|
||||
instruction cache. Conversely, if there are many packets in the list, only the first packet may need
|
||||
to pull things into the i-cache, the second through Nth packet will become cache hits and execute
|
||||
_much_ faster. Moreover, some nodes in VPP make use of processor optimizations like _SIMD_ (single
|
||||
instruction, multiple data), to save on clock cycles if the same operation needs to be executed
|
||||
multiple times.
|
||||
|
||||
{{< image src="/assets/r86s/grafana-clocks.png" alt="Grafana Clocks" >}}
|
||||
|
||||
This graph shows the average CPU cycles per packet for each node. In the first three loadtests
|
||||
(***U1***, ***U2*** and ***U3***), you can see four lines representing the VPP nodes `ip4-input`
|
||||
`ip4-lookup`, `ip4-load-balance` and `ip4-rewrite`. In the fourth loadtest ***U4***, you can see
|
||||
only three nodes: `mpls-input`, `mpls-lookup`, and `ip4-mpls-label-disposition-pipe` (where the MPLS
|
||||
label '16' is swapped for outgoing label '17').
|
||||
|
||||
It's clear to me that when VPP has not many packets/sec to route (ie ***U1*** loadtest), that the
|
||||
cost _per packet_ is actually quite high at around 200 CPU cycles per packet per node. But, if I
|
||||
slam the VPP instance with lots of packets/sec (ie ***U3*** loadtest), that VPP gets _much_ more
|
||||
efficient at what it does. What used to take 200+ cycles per packet, now only takes between 34-52
|
||||
cycles per packet, which is a whopping 5x increase in efficiency. How cool is that?!
|
||||
|
||||
And with that, the Mellanox C3 loadtest completes, and the results are in:
|
||||
|
||||
| ***Mellanox MCX342A-XCCN***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|
||||
|---------------|----------|-------------|-----------|
|
||||
| U1: Unidirectional 1514b | 9.73Gbps | 805kpps | 99.7% |
|
||||
| U2: Unidirectional 64b Multi | 1.11Gbps | 2.30Mpps | 15.5% |
|
||||
| U3: Unidirectional 64b Single | 1.10Gbps | 2.27Mpps | 15.3% |
|
||||
| U4: Unidirectional 64b MPLS | 1.10Gbps | 2.27Mpps | 15.3% |
|
||||
| B1: Bidirectional 1514b | 18.7Gbps | 1.53Mpps | 94.9% |
|
||||
| B2: Bidirectional 64b Multi | 1.54Gbps | 2.29Mpps | 7.69% |
|
||||
| B3: Bidirectional 64b Single | 1.54Gbps | 2.29Mpps | 7.69% |
|
||||
| B4: Bidirectional 64b MPLS | 1.54Gbps | 2.29Mpps | 7.69% |
|
||||
|
||||
Here's something that I find strange though. VPP is clearly not saturated by these 64b loadtests. I
|
||||
know this, because in the case of the Intel X520-DA2 above, I could easily see 13Mpps in a
|
||||
bidirectional test, yet with this Mellanox Cx3 card, no matter if I do one direction or both
|
||||
directions, the max packets/sec tops at 2.3Mpps only -- that's an order of magnitude lower.
|
||||
|
||||
Looking at VPP, both worker threads (the one reading from Port 5/0/0, and the other reading from
|
||||
Port 5/0/1), are not very busy at all. If a VPP worker thread is saturated, this typically shows as
|
||||
a vectors/call of 256.00 and 100% of CPU cycles consumed. But here, that's not the case at all, and
|
||||
most time is spent in DPDK waiting for traffic:
|
||||
|
||||
```
|
||||
Thread 1 vpp_wk_0 (lcore 1)
|
||||
Time 31.2, 10 sec internal node vector rate 2.26 loops/sec 988626.15
|
||||
vector rates in 1.1521e6, out 1.1521e6, drop 0.0000e0, punt 0.0000e0
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
FiftySixGigabitEthernet5/0/1-o active 15949560 35929200 0 8.39e1 2.25
|
||||
FiftySixGigabitEthernet5/0/1-t active 15949560 35929200 0 2.59e2 2.25
|
||||
dpdk-input polling 36250611 35929200 0 6.55e2 .99
|
||||
ethernet-input active 15949560 35929200 0 2.69e2 2.25
|
||||
ip4-input-no-checksum active 15949560 35929200 0 1.01e2 2.25
|
||||
ip4-load-balance active 15949560 35929200 0 7.64e1 2.25
|
||||
ip4-lookup active 15949560 35929200 0 9.26e1 2.25
|
||||
ip4-rewrite active 15949560 35929200 0 9.28e1 2.25
|
||||
unix-epoll-input polling 35367 0 0 1.29e3 0.00
|
||||
---------------
|
||||
Thread 2 vpp_wk_1 (lcore 2)
|
||||
Time 31.2, 10 sec internal node vector rate 2.43 loops/sec 659534.38
|
||||
vector rates in 1.1517e6, out 1.1517e6, drop 0.0000e0, punt 0.0000e0
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
FiftySixGigabitEthernet5/0/0-o active 14845221 35913927 0 8.66e1 2.42
|
||||
FiftySixGigabitEthernet5/0/0-t active 14845221 35913927 0 2.72e2 2.42
|
||||
dpdk-input polling 23114538 35913927 0 6.99e2 1.55
|
||||
ethernet-input active 14845221 35913927 0 2.65e2 2.42
|
||||
ip4-input-no-checksum active 14845221 35913927 0 9.73e1 2.42
|
||||
ip4-load-balance active 14845220 35913923 0 7.17e1 2.42
|
||||
ip4-lookup active 14845221 35913927 0 9.03e1 2.42
|
||||
ip4-rewrite active 14845221 35913927 0 8.97e1 2.42
|
||||
unix-epoll-input polling 22551 0 0 1.37e3 0.00
|
||||
```
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
|
||||
|
||||
I kind of wonder why that is. Is the Mellanox Connect-X3 such a poor performer? Or does it not like
|
||||
small packets? I've read online that Mellanox cards do some form of message compression on the PCI
|
||||
bus, something perhaps to turn off. I don't know, but I don't like it!
|
||||
|
||||
|
||||
### Mellanox Cx5 (25GbE)
|
||||
|
||||
VPP has a few _polling_ nodes, which are pieces of code that execute back-to-back in a tight
|
||||
execution loop. A classic example of a _polling_ node is a _Poll Mode Driver_ from DPDK: this will
|
||||
ask the network cards if they have any packets, and if so: marshall them through the directed graph
|
||||
of VPP. As soon as that's done, the node will immediately ask again. If there is no work to do, this
|
||||
turns into a tight loop with DPDK continuously asking for work. There is however another, lesser
|
||||
known, _polling_ node: `unix-epoll-input`. This node services a local pool of file descriptors, like
|
||||
the _Linux Control Plane_ netlink socket for example, or the clients attached to the Statistics
|
||||
segment, CLI or API. You can see the open files with `show unix files`.
|
||||
|
||||
This design explains why the CPU load of a typical DPDK application is 100% of each worker thread. As
|
||||
an aside, you can ask the PMD to start off in _interrupt_ mode, and only after a certain load switch
|
||||
seemlessly to _polling_ mode. Take a look at `set interface rx-mode` on how to change from _polling_
|
||||
to _interrupt_ or _adaptive_ modes. For performance reasons, I always leave the node in _polling_
|
||||
mode (the default in VPP).
|
||||
|
||||
{{< image src="/assets/r86s/grafana-cpu.png" alt="Grafana CPU" >}}
|
||||
|
||||
The stats segment shows how many clock cycles are being spent in each call of each node. It also
|
||||
knows how often nodes are called. Considering the `unix-epoll-input` and `dpdk-input` nodes will
|
||||
perform what is essentially a tight-loop, the CPU should always add up to 100%. I found that one
|
||||
cool way to show how busy a VPP instance really is, is to look over all CPU threads, and sort
|
||||
through the fraction of time spent in each node:
|
||||
|
||||
* ***Input Nodes***: are those which handle the receive path from DPDK and into the directed graph
|
||||
for routing -- for example `ethernet-input`, then `ip4-input` through to `ip4-lookup` and finally
|
||||
`ip4-rewrite`. This is where VPP usually spends most of its CPU cycles.
|
||||
* ***Output Nodes***: are those which handle the transmit path into DPDK. You'll see these are
|
||||
nodes whose name ends in `-output` or `-tx`. You can also see that in ***U2***, there are only
|
||||
two nodes consuming CPU, while in ***B2*** there are four nodes (because two interfaces are
|
||||
transmitting!)
|
||||
* ***epoll***: the _polling_ node called `unix-epoll-input` depicted in <span
|
||||
style='color:brown;font-weight:bold'>brown</span> in this graph.
|
||||
* ***dpdk***: the _polling_ node called `dpdk-input` depicted in <span
|
||||
style='color:green;font-weight:bold'>green</span> in this graph.
|
||||
|
||||
If there is no work to do, as was the case at around 20:30 in the graph above, the _dpdk_ and
|
||||
_epoll_ nodes are the only two that are consuming CPU. If there's lots of work to do, as was the
|
||||
case in the unidirectional 64b loadtest between 19:40-19:50, and the bidirectional 64b loadtest
|
||||
between 20:45-20:55, I can observe lots of other nodes doing meaningful work, ultimately starving
|
||||
the _dpdk_ and _epoll_ threads until an equilibrium is achieved. This is how I know the VPP process
|
||||
is the bottleneck and not, for example, the PCI bus.
|
||||
|
||||
I let the eight loadtests run, and make note of the bits/sec and packets/sec for each, in this table
|
||||
for the Mellanox Cx5:
|
||||
|
||||
| ***Mellanox MCX542_ACAT***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|
||||
|---------------|----------|-------------|-----------|
|
||||
| U1: Unidirectional 1514b | 24.2Gbps | 2.01Mpps | 98.6% |
|
||||
| U2: Unidirectional 64b Multi | 7.43Gbps | 15.5Mpps | 41.6% |
|
||||
| U3: Unidirectional 64b Single | 3.52Gbps | 7.34Mpps | 19.7% |
|
||||
| U4: Unidirectional 64b MPLS | 7.34Gbps | 15.3Mpps | 46.4% |
|
||||
| B1: Bidirectional 1514b | 24.9Gbps | 2.06Mpps | 50.4% |
|
||||
| B2: Bidirectional 64b Multi | 6.58Gbps | 13.7Mpps | 18.4% |
|
||||
| B3: Bidirectional 64b Single | 3.15Gbps | 6.55Mpps | 8.81% |
|
||||
| B4: Bidirectional 64b MPLS | 6.55Gbps | 13.6Mpps | 18.3% |
|
||||
|
||||
Some observations:
|
||||
1. This Mellanox Cx5 runs quite a bit hotter than the other two cards. It's a PCIe v3.0 which means
|
||||
that despite there only being 4 lanes to the OCP port, it can achieve 31.504 Gbit/s (in case
|
||||
you're wondering, this is 128b/130b encoding on 8.0GT/s x4).
|
||||
1. It easily saturates 25Gbit in one direction with big packets in ***U1***, but as soon as smaller
|
||||
packets are offered, each worker thread tops out at 7.34Mpps or so in ***U2***.
|
||||
1. When testing in both directions, each thread can do about 6.55Mpps or so in ***B2***. Similar to
|
||||
the other NICs, there is a clear slowdown due to CPU cache contention (when using multiple
|
||||
threads), and RX/TX simultaneously (when doing bidirectional tests).
|
||||
1. MPLS is a lot faster -- nearly double based on the use of multiple threads. I think this is
|
||||
because the Cx5 has a hardware hashing function for MPLS packets that looks at the inner
|
||||
payload to sort the traffic into multiple queues, while the Cx3 and Intel X520-DA2 do not.
|
||||
|
||||
## Summary and closing thoughts
|
||||
|
||||
There's a lot to say about these OCP cards. While the Intel is cheap, the Mellanox Cx3 is a bit
|
||||
quirky with its VPP enumeration, and the Mellanox Cx5 is a bit more expensive (and draws a fair bit
|
||||
more power, coming in at 20W), but it do 25Gbit reasonably, it's pretty difficult to make a solid
|
||||
recommendation. What I find interesting is the very low limit in packets/sec on 64b packets coming
|
||||
from the Cx3, while at the same time there seems to be an added benefit in MPLS hashing that the
|
||||
other two cards do not have.
|
||||
|
||||
All things considered, I think I would recommend the Intel x520-DA2 (based on the _Niantic_ chip,
|
||||
Intel 82599ES, total machine coming in at 17W). It seems like it pairs best with the available CPU
|
||||
on the machine. Maybe a Mellanox ConnectX-4 could be a good alternative though, hmmmm :)
|
||||
|
||||
Here's a few files I gathered along the way, in case they are useful:
|
||||
|
||||
* [[LSCPU](/assets/r86s/lscpu.txt)] - [[Likwid Topology](/assets/r86s/likwid-topology.txt)] -
|
||||
[[DMI Decode](/assets/r86s/dmidecode.txt)] - [[LSBLK](/assets/r86s/lsblk.txt)]
|
||||
* Mellanox Cx341: [[dmesg](/assets/r86s/dmesg-cx3.txt)] - [[LSPCI](/assets/r86s/lspci-cx3.txt)] -
|
||||
[[LSHW](/assets/r86s/lshw-cx3.txt)] - [[VPP Patch](/assets/r86s/vpp-cx3.patch)]
|
||||
* Mellanox Cx542: [[dmesg](/assets/r86s/dmesg-cx5.txt)] - [[LSPCI](/assets/r86s/lspci-cx5.txt)] -
|
||||
[[LSHW](/assets/r86s/lshw-cx5.txt)]
|
||||
* Intel X520-DA2: [[dmesg](/assets/r86s/dmesg-x520.txt)] - [[LSPCI](/assets/r86s/lspci-x520.txt)] -
|
||||
[[LSHW](/assets/r86s/lshw-x520.txt)]
|
||||
* VPP Configs: [[startup.conf](/assets/r86s/vpp/startup.conf)] - [[L2 Config](/assets/r86s/vpp/config/l2.vpp)] -
|
||||
[[L3 Config](/assets/r86s/vpp/config/l3.vpp)] - [[MPLS Config](/assets/r86s/vpp/config/mpls.vpp)]
|
||||
|
Reference in New Issue
Block a user