446 lines
25 KiB
Markdown
446 lines
25 KiB
Markdown
---
|
|
date: "2024-08-03T10:51:23Z"
|
|
title: 'Review: Gowin 1U 2x25G (Alder Lake - N305)'
|
|
aliases:
|
|
- /s/articles/2024/08/03/gowin.html
|
|
---
|
|
|
|
# Introduction
|
|
|
|
{{< image float="right" src="/assets/gowin-n305/gowin-logo.png" alt="Gowin logo" >}}
|
|
|
|
Last month, I took a good look at the Gowin R86S based on Jasper Lake (N6005) CPU
|
|
[[ref](https://www.gowinfanless.com/products/network-device/r86s-firewall-router/gw-r86s-u-series)],
|
|
which is a really neat little 10G (and, if you fiddle with it a little bit, 25G!) router that runs
|
|
off of USB-C power and can be rack mounted if you print a bracket. Check out my findings in this
|
|
[[article]({{< ref "2024-07-05-r86s" >}})].
|
|
|
|
David from Gowin reached out and asked me if I was willing to also take a look their Alder Lake
|
|
(N305) CPU, which comes in a 19" rack mountable chassis, running off of 110V/220V AC mains power,
|
|
but also with 2x25G ConnectX-4 network card. Why not! For critical readers: David sent me this
|
|
machine, but made no attempt to influence this article.
|
|
|
|
### Hardware Specs
|
|
|
|
{{< image width="500px" float="right" src="/assets/gowin-n305/case.jpg" alt="Gowin overview" >}}
|
|
|
|
There are a few differences between this 19" model and the compact mini-pc R86S. The most obvious
|
|
difference is the form factor. The R86S is super compact, not inherently rack mountable,
|
|
although I 3D printed a bracket for it. Looking inside, the motherboard is mostly obscured bya large
|
|
cooling block with fins that are flush with the top plate. There are 5 copper ports in the front: 2x
|
|
Intel i226-V (these are 2.5Gbit) and 3x Intel i210 (these are 1Gbit), and one of them offers POE,
|
|
which can be very handy to power a camera or wifi access point. A nice touch.
|
|
|
|
The Gowin server comes with an OCP v2.0 port, just like the R86S does. There's a custom bracket with
|
|
a ribbon cable to the motherboard, and in the bracket is housed a Mellanox ConnectX-4 LX 2x25Gbit
|
|
network card.
|
|
|
|
### A look inside
|
|
|
|
{{< image width="350px" float="right" src="/assets/gowin-n305/mobo.jpg" alt="Gowin mobo" >}}
|
|
|
|
The machine comes with an Intel i3-N305 (Alder Lake) CPU running at a max clock of 3GHz and 4x8GB of
|
|
LPDDR5 memory at 4800MT/s -- and considering the Alder Lake can make use of 4-channel memory, this
|
|
thing should be plenty fast. The memory is soldered to the board, though, so there's no option of
|
|
expanding or changing the memory after buying the unit.
|
|
|
|
Using `likwid-topology`, I determine that the 8-core CPU has no hyperthreads, but just straight up 8
|
|
cores with 32kB of L1 cache, two times 2MB of L3 cache (shared between cores 0-3, and another bank
|
|
shared between cores 4-7), and 6MB of L3 cache shared between all 8 cores. This is again a step up
|
|
from the Jasper Lake CPU, and should make VPP run a little bit faster.
|
|
|
|
What I find a nice touch is that Gowin has shipped this board with a 128GB MMC flash disk, which
|
|
appears in Linux as `/dev/mmcblk0` and can be used to install an OS. However, there are also two
|
|
NVME slots with M.2 2280, one M.SATA slot and two additional SATA slots with 4-pin power. On the
|
|
side of the chassis is a clever bracket that holds three 2.5" SSDs in a staircase configuration.
|
|
That's quite a lot of storage options, and given the CPU has some oompf, this little one could
|
|
realistically be a NAS, although I'd prefer it to be a VPP router!
|
|
|
|
{{< image width="350px" float="right" src="/assets/gowin-n305/ocp-ssd.jpg" alt="Gowin ocp-ssd" >}}
|
|
|
|
The copper RJ45 ports are all on the motherboard, and there's an OCP breakout port that fits any OCP
|
|
v2.0 network card. Gowin shipped it with a ConnectX-4 LX, but since I had a ConnectX-5 EN, I will
|
|
take a look at performance with both cards. One critical observation, as with the Jasper Lake R86S,
|
|
is that there are only 4 PCIe v3.0 lanes routed to the OCP, which means that the spiffy x8 network
|
|
interfaces (both the Cx4 and the Cx5 I have here) will run at half speed. Bummer!
|
|
|
|
The power supply is a 100-240V switching PSU with about 150W of power available. When running idle,
|
|
with one 1TB NVME drive, I measure 38.2W on the 220V side. When running VPP at full load, I measure
|
|
47.5W of total load. That's totally respectable for a 2x 25G + 2x 2.5G + 3x 1G VPP router.
|
|
|
|
I've added some pictures to a [[Google Photos](https://photos.app.goo.gl/rbd9xJBUUcnCgW7v9)] album,
|
|
if you'd like to take a look.
|
|
|
|
## VPP Loadtest: RDMA versus DPDK
|
|
|
|
You (hopefully>) came here to read about VPP stuff. For years now, I have been curious as to the
|
|
performance and functional differences in VPP between using DPDK and the native RDMA driver support
|
|
that Mellanox network cards have support for. In this article, I'll do four loadtests, with the
|
|
stock Mellanox Cx4 that comes with the Gowin server, and with the Mellanox Cx5 card that I had
|
|
bought for the R86S. I'll take a look at the differences betwen DPDK on the one hand and RDMA on the
|
|
other. This will yield, for me at least, a better understanding on the differences. Spoiler: there
|
|
are not many!
|
|
|
|
## DPDK
|
|
|
|
{{< image float="right" src="/assets/gowin-n305/dpdk.png" alt="DPDK logo" >}}
|
|
|
|
The Data Plane Development Kit (DPDK) is an open source software project managed by the Linux
|
|
Foundation. It provides a set of data plane libraries and network interface controller polling-mode
|
|
drivers for offloading ethernet packet processing from the operating system kernel to processes
|
|
running in user space. This offloading achieves higher computing efficiency and higher packet
|
|
throughput than is possible using the interrupt-driven processing provided in the kernel.
|
|
|
|
You can read more about it on [[Wikipedia](https://en.wikipedia.org/wiki/Data_Plane_Development_Kit)]
|
|
or on the [[DPDK Homepage](https://dpdk.org/)]. VPP uses DPDK as one of the (more popular) drivers
|
|
for network card interaction.
|
|
|
|
### DPDK: ConnectX-4 Lx
|
|
|
|
This is the OCP network card that came with the Gowin server. It identifies in Linux as:
|
|
|
|
```
|
|
0e:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
|
|
0e:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
|
|
```
|
|
|
|
Albeit with an important warning in `dmesg`, about the lack of PCIe lanes:
|
|
|
|
```
|
|
[3.704174] pci 0000:0e:00.0: [15b3:1015] type 00 class 0x020000
|
|
[3.708154] pci 0000:0e:00.0: reg 0x10: [mem 0x60e2000000-0x60e3ffffff 64bit pref]
|
|
[3.716221] pci 0000:0e:00.0: reg 0x30: [mem 0x80d00000-0x80dfffff pref]
|
|
[3.724079] pci 0000:0e:00.0: Max Payload Size set to 256 (was 128, max 512)
|
|
[3.732678] pci 0000:0e:00.0: PME# supported from D3cold
|
|
[3.736296] pci 0000:0e:00.0: reg 0x1a4: [mem 0x60e4800000-0x60e48fffff 64bit pref]
|
|
[3.756916] pci 0000:0e:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link
|
|
at 0000:00:1d.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)
|
|
```
|
|
|
|
With a PCIe v3.0 overhead of 130b/128b, that means the card will have (128/130) * 32 = 31.508 Gbps
|
|
available and I'm actually not quite sure why the kernel claims 31.504G in the log message. Anyway,
|
|
the card itself works just fine at this speed, and is immediately detected in DPDK while continuing
|
|
to use the `mlx5_core` driver. This would be a bit different with Intel based cards, as there the
|
|
driver has to be rebound to `vfio_pci` or `uio_pci_generic`. Here, the NIC itself remains visible
|
|
(and usable!) in Linux, which is kind of neat.
|
|
|
|
I do my standard set of eight loadtests: {unidirectional,bidirectional} x {1514b, 64b multiflow, 64b
|
|
singleflow, MPLS}. This teaches me a lot about how the NIC uses flow hashing, and what it's maximum
|
|
performance is. Without further ado, here's the results:
|
|
|
|
| Loadtest: Gowin CX4 DPDK | L1 bits/sec | Packets/sec | % of Line |
|
|
|----------------------------------|-------------|-------------|-----------|
|
|
| 1514b-unidirectional | 25.00 Gbps | 2.04 Mpps | 100.2 % |
|
|
| 64b-unidirectional | 7.43 Gbps | 11.05 Mpps | 29.7 % |
|
|
| 64b-single-unidirectional | 3.09 Gbps | 4.59 Mpps | 12.4 % |
|
|
| 64b-mpls-unidirectional | 7.34 Gbps | 10.93 Mpps | 29.4 % |
|
|
| 1514b-bidirectional | 22.63 Gbps | 1.84 Mpps | 45.2 % |
|
|
| 64b-bidirectional | 7.42 Gbps | 11.04 Mpps | 14.8 % |
|
|
| 64b-single-bidirectional | 5.33 Gbps | 7.93 Mpps | 10.7 % |
|
|
| 64b-mpls-bidirectional | 7.36 Gbps | 10.96 Mpps | 14.8 % |
|
|
|
|
Some observations:
|
|
|
|
* In the large packet department, the NIC easily saturates the port speed in _unidirectional_, and
|
|
saturates the PCI bus (x4) in _bidirectional_ forwarding. I'm surprised that the bidirectional
|
|
forwarding capacity is a bit lower (1.84Mpps versus 2.04Mpps).
|
|
* The NIC is using three queues, and the difference between single flow (which could only use one
|
|
queue, and one CPU thread) is not exactly linear (4.59Mpps vs 11.05Mpps for 3 RX queues)
|
|
* The MPLS performance is higher than single flow, which I think means that the NIC is capable of
|
|
hashing the packets based on the _inner packet_. Otherwise, while using the same MPLS label, the
|
|
Cx3 and other NICs tend to just leverage only one receive queue.
|
|
|
|
I'm very curious how this NIC stacks up between DPDK and RDMA -- read on below for my results!
|
|
|
|
### DPDK: ConnectX-5 EN
|
|
|
|
I swap the card out of its OCP bay and replace it with a ConnectX-5 EN that I have from when I
|
|
tested the [[R86S]({{< ref "2024-07-05-r86s" >}})]. It identifies as:
|
|
|
|
```
|
|
0e:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
|
|
0e:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
|
|
```
|
|
|
|
And similar to the ConnectX-4, this card also complains about PCIe bandwidth:
|
|
|
|
```
|
|
[6.478898] mlx5_core 0000:0e:00.0: firmware version: 16.25.4062
|
|
[6.485393] mlx5_core 0000:0e:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link
|
|
at 0000:00:1d.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)
|
|
[6.816156] mlx5_core 0000:0e:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
|
|
[6.841005] mlx5_core 0000:0e:00.0: Port module event: module 0, Cable plugged
|
|
[7.023602] mlx5_core 0000:0e:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
|
|
[7.177744] mlx5_core 0000:0e:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
|
|
```
|
|
|
|
With that said, the loadtests are quite a bit more favorable for the newer ConnectX-5:
|
|
|
|
|
|
| Loadtest: Gowin CX5 DPDK | L1 bits/sec | Packets/sec | % of Line |
|
|
|----------------------------------|-------------|-------------|-----------|
|
|
| 1514b-unidirectional | 24.98 Gbps | 2.04 Mpps | 99.7 % |
|
|
| 64b-unidirectional | 10.71 Gbps | 15.93 Mpps | 42.8 % |
|
|
| 64b-single-unidirectional | 4.44 Gbps | 6.61 Mpps | 17.8 % |
|
|
| 64b-mpls-unidirectional | 10.36 Gbps | 15.42 Mpps | 41.5 % |
|
|
| 1514b-bidirectional | 24.70 Gbps | 2.01 Mpps | 49.4 % |
|
|
| 64b-bidirectional | 14.58 Gbps | 21.69 Mpps | 29.1 % |
|
|
| 64b-single-bidirectional | 8.38 Gbps | 12.47 Mpps | 16.8 % |
|
|
| 64b-mpls-bidirectional | 14.50 Gbps | 21.58 Mpps | 29.1 % |
|
|
|
|
Some observations:
|
|
|
|
* The NIC also saturates 25G in one direction with large packets, and saturates the PCI bus when
|
|
pushing in both directions.
|
|
* Single queue / thread operation at 6.61Mpps is a fair bit higher than Cx4 (which is 4.59Mpps)
|
|
* Multiple threads scale almost linearly, from 6.61Mpps in 1Q to 15.93Mpps in 3Q. That's
|
|
respectable!
|
|
* Bidirectional small packet performance is pretty great at 21.69Mpps, more than double that of
|
|
the Cx4 (which is 11.04Mpps).
|
|
* MPLS rocks! The NIC forwards 21.58Mpps of MPLS traffic.
|
|
|
|
One thing I should note, is that at this point, the CPUs are not fully saturated. Looking at
|
|
Prometheus/Grafana for this set of loadtests:
|
|
|
|
{{< image src="/assets/gowin-n305/cx5-cpu.png" alt="Cx5 CPU" >}}
|
|
|
|
What I find interesting is that in no cases did any CPU thread run to 100% utilization. In the 64b
|
|
single flow loadtests (from 14:00-14:10 and from 15:05-15:15), the CPU threads definitely got close,
|
|
but they did not clip -- which does lead me to believe that the NIC (or the PCIe bus!) are the
|
|
bottleneck.
|
|
|
|
By the way, the bidirectional single flow 64b loadtest shows two threads that have an overall
|
|
slightly _lower_ utilization (63%) versus the unidirectional single flow 64 loadtest (at 78.5%). I
|
|
think this can be explained by the two threads being able to use/re-use each others' cache lines.
|
|
|
|
|
|
***Conclusion***: ConnectX-5 performs significantly better than ConnectX-4 with DPDK.
|
|
|
|
## RDMA
|
|
|
|
{{< image float="right" src="/assets/gowin-n305/rdma.png" alt="RDMA artwork" >}}
|
|
|
|
RDMA supports zero-copy networking by enabling the network adapter to transfer data from the wire
|
|
directly to application memory or from application memory directly to the wire, eliminating the need
|
|
to copy data between application memory and the data buffers in the operating system. Such transfers
|
|
require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel
|
|
with other system operations. This reduces latency in message transfer.
|
|
|
|
You can read more about it on [[Wikipedia](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]
|
|
VPP uses RDMA in a clever way, relying on the Linux library for rdma-core (libibverb) to create a
|
|
custom userspace poll-mode driver, specifically for Ethernet packets. Despite using the RDMA APIs,
|
|
this is not about RDMA (no Infiniband, no RoCE, no iWARP), just pure traditional Ethernet packets.
|
|
Many VPP developers recommend and prefer RDMA for Mellanox devices. I myself have been more
|
|
comfortable with DPDK. But, now is the time to _FAFO_.
|
|
|
|
### RDMA: ConnectX-4 Lx
|
|
|
|
Considering I used three RX queues for DPDK, I instruct VPP now to use 3 receive queues for RDMA as
|
|
well. I remove the `dpdk_plugin.so` from `startup.conf`, although I could also have kept the DPDK
|
|
plugin running (to drive the 1.0G and 2.5G ports!) and de-selected the `0000:0e:00.0` and
|
|
`0000:0e:00.1` PCI entries, so that the RDMA driver can grab them.
|
|
|
|
The VPP startup now looks like this:
|
|
|
|
```
|
|
vpp# create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 512 tx-queue-size 512
|
|
num-rx-queues 3 no-multi-seg no-striding max-pktlen 2026
|
|
vpp# create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 512 tx-queue-size 512
|
|
num-rx-queues 3 no-multi-seg no-striding max-pktlen 2026
|
|
vpp# set int mac address xxv0 02:fe:4a:ce:c2:fc
|
|
vpp# set int mac address xxv1 02:fe:4e:f5:82:e7
|
|
```
|
|
|
|
I realize something pretty cool - the RDMA interface gets an ephemeral (randomly generated) MAC
|
|
address, while the main network card in Linux stays available. The NIC internally has a hardware
|
|
filter for the RDMA bound MAC address and gives it to VPP -- the implication is that the 25G NICs
|
|
can *also* be used in Linux itself. That's slick.
|
|
|
|
Performance wise:
|
|
|
|
| Loadtest: Gowin CX4 with RDMA | L1 bits/sec | Packets/sec | % of Line |
|
|
|----------------------------------|-------------|-------------|-----------|
|
|
| 1514b-unidirectional | 25.01 Gbps | 2.04 Mpps | 100.3 % |
|
|
| 64b-unidirectional | 12.32 Gbps | 18.34 Mpps | 49.1 % |
|
|
| 64b-single-unidirectional | 6.21 Gbps | 9.24 Mpps | 24.8 % |
|
|
| 64b-mpls-unidirectional | 11.95 Gbps | 17.78 Mpps | 47.8 % |
|
|
| 1514b-bidirectional | 26.24 Gbps | 2.14 Mpps | 52.5 % |
|
|
| 64b-bidirectional | 14.94 Gbps | 22.23 Mpps | 29.9 % |
|
|
| 64b-single-bidirectional | 11.53 Gbps | 17.16 Mpps | 23.1 % |
|
|
| 64b-mpls-bidirectional | 14.99 Gbps | 22.30 Mpps | 30.0 % |
|
|
|
|
Some thoughts:
|
|
|
|
* The RDMA driver is significantly _faster_ than DPDK in this configuration. Hah!
|
|
* 1514b are fine in both directions, RDMA slightly outperforms DPDK in the bidirectional test.
|
|
* 64b is massively faster:
|
|
* Unidirectional multiflow: RDMA 18.34Mpps, DPDK 11.05Mpps
|
|
* Bidirectional multiflow: RDMA 22.23Mpps, DPDK 11.04Mpps
|
|
* Bidirectional MPLS: RDMA 22.30Mpps, DPDK 10.93Mpps.
|
|
|
|
***Conclusion***: I would say, roughly speaking, that RDMA outperforms DPDK on the Cx4 by a factor
|
|
of two. That's really cool, especially because ConnectX-4 network cards are found very cheap these
|
|
days.
|
|
|
|
### RDMA: ConnectX-5 EN
|
|
|
|
Well then what about the newer Mellanox ConnectX-5 card? Something surprising happens when I boot
|
|
the machine and start the exact same configuration as with the Cx4, the loadtests results almost
|
|
invariably suck:
|
|
|
|
| Loadtest: Gowin CX5 with RDMA | L1 bits/sec | Packets/sec | % of Line |
|
|
|----------------------------------|-------------|-------------|-----------|
|
|
| 1514b-unidirectional | 24.95 Gbps | 2.03 Mpps | 99.6 % |
|
|
| 64b-unidirectional | 6.19 Gbps | 9.22 Mpps | 24.8 % |
|
|
| 64b-single-unidirectional | 3.27 Gbps | 4.87 Mpps | 13.1 % |
|
|
| 64b-mpls-unidirectional | 6.18 Gbps | 9.20 Mpps | 24.7 % |
|
|
| 1514b-bidirectional | 24.59 Gbps | 2.00 Mpps | 49.2 % |
|
|
| 64b-bidirectional | 8.84 Gbps | 13.15 Mpps | 17.7 % |
|
|
| 64b-single-bidirectional | 5.57 Gbps | 8.29 Mpps | 11.1 % |
|
|
| 64b-mpls-bidirectional | 8.77 Gbps | 13.05 Mpps | 17.5 % |
|
|
|
|
Yikes! Cx5 in its default mode can still saturate 1514b loadtests, but turns into single digit with
|
|
almost all other loadtest types. I'm surprised also that single flow loadtest clocks in at only
|
|
4.87Mpps, that's about the same speed I sawwith the ConnectX-4 using DPDK. This does not look good
|
|
at all, and honestly, I don't believe it.
|
|
|
|
So I start fiddling with settings.
|
|
|
|
#### ConnectX-5 EN: Tuning Parameters
|
|
|
|
There are a few things I found that might speed up processing in the ConnectX network card:
|
|
|
|
1. Allowing for larger PCI packets - by default 512b, I can raise this to 1k, 2k or even 4k.
|
|
`setpci -s 0e:00.0 68.w` will return some hex number ABCD, the A here stands for max read size.
|
|
0=128b, 1=256b, 2=512b, 3=1k, 4=2k, 8=4k. I can set the value by writing `setpci -s 0e:00.0
|
|
68.w=3BCD`, which immediately speeds up the loadtests!
|
|
1. Mellanox recommends to turn on CQE compression, to allow for the PCI messages to be aggressively
|
|
compressed, saving bandwidth. This helps specifically with _smaller_ packets, as the PCI message
|
|
overhead really starts to matter. `mlxconfig -d 0e:00.0 set CQE_COMPRESSION=1` and reboot.
|
|
1. For MPLS, the Cx5 can do flow matching on the inner packet (rather than hashing all packets to
|
|
the same queue based on the MPLS label) -- `mlxconfig -d 0e:00.0 set
|
|
FLEX_PARSER_PROFILE_ENABLE=1` and reboot.
|
|
1. Likely the number of receive queues matters, and can be set in the `create interface rdma`
|
|
command.
|
|
|
|
I notice that CQE_COMPRESSION and FLEX_PARSER_PROFILE_ENABLE help in all cases, so I set them and
|
|
reboot. The PCI packets resizing also helps specifically with smaller packets, so I set that too in
|
|
`/etc/rc.local`. The fourth variable is left over, which is varying receive queue count.
|
|
|
|
Here's a comparison that, to me at least, was surprising. With three receive queues, thus three
|
|
CPU threads each receiving 4.7Mpps and sending 3.1Mpps, performance looked like this:
|
|
```
|
|
$ vppctl create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 1024 tx-queue-size 4096
|
|
num-rx-queues 3 mode dv no-multi-seg max-pktlen 2026
|
|
$ vppctl create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 1024 tx-queue-size 4096
|
|
num-rx-queues 3 mode dv no-multi-seg max-pktlen 2026
|
|
|
|
$ vppctl show run | grep vector\ rates | grep -v in\ 0
|
|
vector rates in 4.7586e6, out 3.2259e6, drop 3.7335e2, punt 0.0000e0
|
|
vector rates in 4.9881e6, out 3.2206e6, drop 3.8344e2, punt 0.0000e0
|
|
vector rates in 5.0136e6, out 3.2169e6, drop 3.7335e2, punt 0.0000e0
|
|
```
|
|
|
|
This is fishy - why is the inbound rate much higher than the outbound rate? The behavior is
|
|
consistent in multi-queue setups. If I create 2 queues it's 8.45Mpps in and 7.98Mpps out:
|
|
|
|
```
|
|
$ vppctl create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 1024 tx-queue-size 4096
|
|
num-rx-queues 2 mode dv no-multi-seg max-pktlen 2026
|
|
$ vppctl create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 1024 tx-queue-size 4096
|
|
num-rx-queues 2 mode dv no-multi-seg max-pktlen 2026
|
|
|
|
$ vppctl show run | grep vector\ rates | grep -v in\ 0
|
|
vector rates in 8.4533e6, out 7.9804e6, drop 0.0000e0, punt 0.0000e0
|
|
vector rates in 8.4517e6, out 7.9798e6, drop 0.0000e0, punt 0.0000e0
|
|
```
|
|
|
|
And when I create only one queue, the same appears:
|
|
|
|
```
|
|
$ vppctl create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 1024 tx-queue-size 4096
|
|
num-rx-queues 1 mode dv no-multi-seg max-pktlen 2026
|
|
$ vppctl create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 1024 tx-queue-size 4096
|
|
num-rx-queues 1 mode dv no-multi-seg max-pktlen 2026
|
|
|
|
$ vppctl show run | grep vector\ rates | grep -v in\ 0
|
|
vector rates in 1.2082e7, out 9.3865e6, drop 0.0000e0, punt 0.0000e0
|
|
```
|
|
|
|
But now that I've scaled down to only one queue (and thus one CPU thread doing all the work), I
|
|
manage to find a clue in the `show runtime` command:
|
|
```
|
|
Thread 1 vpp_wk_0 (lcore 1)
|
|
Time 321.1, 10 sec internal node vector rate 256.00 loops/sec 46813.09
|
|
vector rates in 1.2392e7, out 9.4015e6, drop 0.0000e0, punt 1.5571e-2
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
ethernet-input active 15543357 3979099392 0 2.79e1 256.00
|
|
ip4-input-no-checksum active 15543352 3979098112 0 1.26e1 256.00
|
|
ip4-load-balance active 15543357 3979099387 0 9.17e0 255.99
|
|
ip4-lookup active 15543357 3979099387 0 1.43e1 255.99
|
|
ip4-rewrite active 15543357 3979099387 0 1.69e1 255.99
|
|
rdma-input polling 15543357 3979099392 0 2.57e1 256.00
|
|
xxv1-output active 15543357 3979099387 0 5.03e0 255.99
|
|
xxv1-tx active 15543357 3018807035 0 4.35e1 194.22
|
|
```
|
|
|
|
It takes a bit of practice to spot this, but see how `xx1-output` is running at 256 vectors/call,
|
|
while `xxv1-tx` is running at only 194.22 vectors/call? That means that VPP is dutifully handling
|
|
the whole packet, but when it is handed off to RDMA to marshall onto the hardware, it's getting
|
|
lost! And indeed, this is corroborated by `show errors`:
|
|
|
|
```
|
|
$ vppctl show err
|
|
Count Node Reason Severity
|
|
3334 null-node blackholed packets error
|
|
7421 ip4-arp ARP requests throttled info
|
|
3 ip4-arp ARP requests sent info
|
|
1454511616 xxv1-tx no free tx slots error
|
|
16 null-node blackholed packets error
|
|
```
|
|
|
|
Wow, billions of packets have been routed by VPP but then had to be discarded because RDMA output
|
|
could not keep up. Ouch.
|
|
|
|
Compare the previous CPU utilization graph (from the Cx5/DPDK loadtest) with this Cx5/RDMA/1-RXQ
|
|
loadtest:
|
|
|
|
{{< image src="/assets/gowin-n305/cx5-cpu-rdma1q.png" alt="Cx5 CPU with 1Q" >}}
|
|
|
|
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
|
|
|
Here I can clearly see that the one CPU thread (in yellow for unidirectional) and the two CPU
|
|
therads (one for each of the bidirectional flows) jump up to 100% and stay there. This means that
|
|
when VPP is completely pegged, it is receiving 12.4Mpps _per core_, but only manages to get RDMA to
|
|
send 9.40Mpps of those on the wire. The performance further deteriorates when multiple receive queues
|
|
are in play. Note: 12.4Mpps is pretty great for these CPU threads.
|
|
|
|
|
|
***Conclusion***: Single Queue RDMA based Cx5 will allow for about 9Mpps per interface, which is a
|
|
little bit better than DPDK performance; but Cx4 and Cx5 performance are not too far apart.
|
|
|
|
## Summary and closing thoughts
|
|
|
|
Looking at the RDMA results for both Cx4 and Cx5, when using only one thread, gives fair performance
|
|
with very low CPU cost per port -- however I could not manage to get rid of the `no free tx slots`
|
|
errors, and VPP can consume / process / forward more packets than RDMA is willing to marshall out on
|
|
the wire, which is disappointing.
|
|
|
|
That said, both RDMA and DPDK performance is line rate at 25G unidirectional with sufficiently large
|
|
packets, and for small packets, it can realistically handle roughly 9Mpps per CPU thread.
|
|
Considering the CPU has 8 threads -- of which 6 usable by VPP -- the machine has more CPU than it
|
|
needs to drive the NICs. It should be a really great router at 10Gbps traffic rates, and a very fair
|
|
router at 25Gbps either with RDMA or DPDK.
|
|
|
|
Here's a few files I gathered along the way, in case they are useful:
|
|
|
|
* [[LSCPU](/assets/gowin-n305/lscpu.txt)] - [[Likwid Topology](/assets/gowin-n305/likwid-topology.txt)] -
|
|
[[DMI Decode](/assets/gowin-n305/dmidecode.txt)] - [[LSBLK](/assets/gowin-n305/lsblk.txt)]
|
|
* Mellanox MCX4421A-ACAN: [[dmesg](/assets/gowin-n305/dmesg-cx4.txt)] - [[LSPCI](/assets/gowin-n305/lspci-cx4.txt)] -
|
|
[[LSHW](/assets/gowin-n305/lshw-cx4.txt)]
|
|
* Mellanox MCX542B-ACAN: [[dmesg](/assets/gowin-n305/dmesg-cx5.txt)] - [[LSPCI](/assets/gowin-n305/lspci-cx5.txt)] -
|
|
[[LSHW](/assets/gowin-n305/lshw-cx5.txt)]
|
|
* VPP Configs: [[startup.conf](/assets/gowin-n305/vpp/startup.conf)] - [[L2 Config](/assets/gowin-n305/vpp/config/l2.vpp)] -
|
|
[[L3 Config](/assets/gowin-n305/vpp/config/l3.vpp)] - [[MPLS Config](/assets/gowin-n305/vpp/config/mpls.vpp)]
|
|
|