503 lines
30 KiB
Markdown
503 lines
30 KiB
Markdown
---
|
|
date: "2024-02-17T12:17:54Z"
|
|
title: VPP on FreeBSD - Part 2
|
|
aliases:
|
|
- /s/articles/2024/02/17/vpp-freebsd-2.html
|
|
---
|
|
|
|
# About this series
|
|
|
|
{{< image width="300px" float="right" src="/assets/freebsd-vpp/freebsd-logo.png" alt="FreeBSD" >}}
|
|
|
|
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
|
|
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
|
|
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
|
|
are shared between the two. Over the years, folks have asked me regularly "What about BSD?" and to
|
|
my surprise, late last year I read an announcement from the _FreeBSD Foundation_
|
|
[[ref](https://freebsdfoundation.org/blog/2023-in-review-software-development/)] as they looked back
|
|
over 2023 and forward to 2024:
|
|
|
|
> ***Porting the Vector Packet Processor to FreeBSD***
|
|
>
|
|
> Vector Packet Processing (VPP) is an open-source, high-performance user space networking stack
|
|
> that provides fast packet processing suitable for software-defined networking and network function
|
|
> virtualization applications. VPP aims to optimize packet processing through vectorized operations
|
|
> and parallelism, making it well-suited for high-speed networking applications. In November of this
|
|
> year, the Foundation began a contract with Tom Jones, a FreeBSD developer specializing in network
|
|
> performance, to port VPP to FreeBSD. Under the contract, Tom will also allocate time for other
|
|
> tasks such as testing FreeBSD on common virtualization platforms to improve the desktop
|
|
> experience, improving hardware support on arm64 platforms, and adding support for low power idle
|
|
> on Intel and arm64 hardware.
|
|
|
|
In my first [[article]({{< ref "2024-02-10-vpp-freebsd-1" >}})], I wrote a sort of a _hello world_
|
|
by installing FreeBSD 14.0-RELEASE on both a VM and a bare metal Supermicro, and showed that Tom's
|
|
VPP branch compiles, runs and pings. In this article, I'll take a look at some comparative
|
|
performance numbers.
|
|
|
|
## Comparing implementations
|
|
|
|
FreeBSD has an extensive network stack, including regular _kernel_ based functionality such as
|
|
routing, filtering and bridging, a faster _netmap_ based datapath, including some userspace
|
|
utilities like a _netmap_ bridge, and of course completely userspace based dataplanes, such as the
|
|
VPP project that I'm working on here. Last week, I learned that VPP has a _netmap_ driver, and from
|
|
previous travels I am already quite familiar with its _DPDK_ based forwarding. I decide to do a
|
|
baseline loadtest for each of these on the Supermicro Xeon-D1518 that I installed last week. See the
|
|
[[article]({{< ref "2024-02-10-vpp-freebsd-1" >}})] for details on the setup.
|
|
|
|
The loadtests will use a common set of different configurations, using Cisco T-Rex's default
|
|
benchmark profile called `bench.py`:
|
|
1. **var2-1514b**: Large Packets, multiple flows with modulating source and destination IPv4
|
|
addresses, often called an 'iperf test', with packets of 1514 bytes.
|
|
1. **var2-imix**: Mixed Packets, multiple flows, often called an 'imix test', which includes a
|
|
bunch of 64b, 390b and 1514b packets.
|
|
1. **var2-64b**: Small Packets, still multiple flows, 64 bytes, which allows for multiple receive
|
|
queues and kernel or application threads.
|
|
1. **64b**: Small Packets, but now single flow, often called 'linerate test', with a packet size
|
|
of 64 bytes, limiting to one receive queue.
|
|
|
|
Each of these four loadtests might occur in only undirectionally (port0 -> port1) or bidirectionally
|
|
(port0 <-> port1). This yields eight different loadtests, each taking about 8 minutes. I put the kettle
|
|
on and get underway.
|
|
|
|
### FreeBSD 14: Kernel Bridge
|
|
|
|
The machine I'm testing has a quad-port Intel i350 (1Gbps copper, using the FreeBSD `igb(4)` driver),
|
|
a dual-port Intel X522 (10Gbps SFP+, using the `ix(4)` driver), and a dual-port Intel i710-XXV
|
|
(25Gbps SFP28, using the `ixl(4)` driver). I decide to live it up a little, and choose the 25G ports
|
|
for my loadtests today, even if I think this machine with its relatively low-end Xeon-D1518 CPU
|
|
will struggle a little bit at very high packet rates. No pain, no gain, _amirite_?
|
|
|
|
I take my fresh FreeBSD 14.0-RELEASE install, without any tinkering other than compiling a GENERIC
|
|
kernel that has support for the DPDK modules I'll need later. For my first loadtest, I create a
|
|
kernel based bridge as follows, just tying the two 25G interfaces together:
|
|
|
|
```
|
|
[pim@france /usr/obj]$ uname -a
|
|
FreeBSD france 14.0-RELEASE FreeBSD 14.0-RELEASE #0: Sat Feb 10 22:18:51 CET 2024 root@france:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
|
|
|
|
[pim@france ~]$ dmesg | grep ixl
|
|
ixl0: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k> mem 0xf8000000-0xf8ffffff,0xf9008000-0xf900ffff irq 16 at device 0.0 on pci7
|
|
ixl1: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k> mem 0xf7000000-0xf7ffffff,0xf9000000-0xf9007fff irq 16 at device 0.1 on pci7
|
|
|
|
[pim@france ~]$ sudo ifconfig bridge0 create
|
|
[pim@france ~]$ sudo ifconfig bridge0 addm ixl0 addm ixl1 up
|
|
[pim@france ~]$ sudo ifconfig ixl0 up
|
|
[pim@france ~]$ sudo ifconfig ixl1 up
|
|
[pim@france ~]$ ifconfig bridge0
|
|
bridge0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
|
|
options=0
|
|
ether 58:9c:fc:10:6c:2e
|
|
id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
|
|
maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
|
|
root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
|
|
member: ixl1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
|
|
ifmaxaddr 0 port 4 priority 128 path cost 800
|
|
member: ixl0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
|
|
ifmaxaddr 0 port 3 priority 128 path cost 800
|
|
groups: bridge
|
|
nd6 options=9<PERFORMNUD,IFDISABLED>
|
|
|
|
```
|
|
|
|
One thing that I quickly realize, is that FreeBSD, when using hyperthreading, does have 8 threads
|
|
available, but only 4 of them participate in forwarding. When I put the machine under load, I see a
|
|
curious 399% spent in _kernel_ while I see 402% in _idle_:
|
|
|
|
{{< image src="/assets/freebsd-vpp/top-kernel-bridge.png" alt="FreeBSD top" >}}
|
|
|
|
When I then do a single-flow unidirectional loadtest, the expected outcome is that only one CPU
|
|
participates (100% in _kernel_ and 700% in _idle_) and if I perform a single-flow bidirectional
|
|
loadtest, my expectations are confirmed again, seeing two CPU threads do the work (200% in _kernel_
|
|
and 600% in _idle_).
|
|
|
|
While the math checks out, the performance is a little bit less impressive:
|
|
|
|
| Type | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Rate |
|
|
| ---- | --------- | ----------- | -------- | --------- |
|
|
| vm=var2,size=1514 | Unidirectional | 2.02Mpps | 24.77Gbps | 99% |
|
|
| vm=var2,size=imix | Unidirectional | 3.48Mpps | 10.23Gbps | 43% |
|
|
| vm=var2,size=64 | Unidirectional | 3.61Mpps | 2.43Gbps | 9.7% |
|
|
| size=64 | Unidirectional | 1.22Mpps | 0.82Gbps | 3.2% |
|
|
| vm=var2,size=1514 | Bidirectional | 3.77Mpps | 46.31Gbps | 93% |
|
|
| vm=var2,size=imix | Bidirectional | 3.81Mpps | 11.22Gbps | 24% |
|
|
| vm=var2,size=64 | Bidirectional | 4.02Mpps | 2.69Gbps | 5.4% |
|
|
| size=64 | Bidirectional | 2.29Mpps | 1.54Gbps | 3.1% |
|
|
|
|
***Conclusion***: FreeBSD's kernel on this Xeon-D1518 processor can handle about 1.2Mpps per CPU
|
|
thread, and I can use only four of them. FreeBSD is happy to forward big packets, and I can
|
|
reasonably reach 2x25Gbps but once I start ramping up the packets/sec by lowering the packet size,
|
|
things very quickly deteriorate.
|
|
|
|
### FreeBSD 14: netmap Bridge
|
|
|
|
Tom pointed out a tool in the source tree, called the _netmap bridge_ originally written by Luigi
|
|
Rizzo and Matteo Landi. FreeBSD ships the source code, but you can also take a look at their GitHub
|
|
repository [[ref](https://github.com/luigirizzo/netmap/blob/master/apps/bridge/bridge.c)].
|
|
|
|
What is _netmap_ anyway? It's a framework for extremely fast and efficient packet I/O for userspace
|
|
and kernel clients, and for Virtual Machines. It runs on FreeBSD, Linux and some versions of
|
|
Windows. As an aside, my buddy Pavel from FastNetMon pointed out a blogpost from 2015 in which
|
|
Cloudflare folks described a way to do DDoS mitigation on Linux using traffic classification to
|
|
program the network cards to move certain offensive traffic to a dedicated hardware queue, and
|
|
service that queue from a _netmap_ client. If you're curious (I certainly was!), you might take a
|
|
look at that cool write-up
|
|
[[here](https://blog.cloudflare.com/single-rx-queue-kernel-bypass-with-netmap)].
|
|
|
|
I compile the code and put it to work, and the man-page tells me that I need to fiddle with the
|
|
interfaces a bit. They need to be:
|
|
|
|
* set to _promiscuous_, which makes sense as they have to receive ethernet frames sent to MAC
|
|
addresses other than their own
|
|
* turn off any hardware offloading, notably `-rxcsum -txcsum -tso4 -tso6 -lro`
|
|
* my user needs write permission to `/dev/netmap` to bind the interfaces from userspace.
|
|
|
|
```
|
|
[pim@france /usr/src/tools/tools/netmap]$ make
|
|
[pim@france /usr/src/tools/tools/netmap]$ cd /usr/obj/usr/src/amd64.amd64/tools/tools/netmap
|
|
[pim@france .../tools/netmap]$ sudo ifconfig ixl0 -rxcsum -txcsum -tso4 -tso6 -lro promisc
|
|
[pim@france .../tools/netmap]$ sudo ifconfig ixl1 -rxcsum -txcsum -tso4 -tso6 -lro promisc
|
|
[pim@france .../tools/netmap]$ sudo chmod 660 /dev/netmap
|
|
[pim@france .../tools/netmap]$ ./bridge -i netmap:ixl0 -i netmap:ixl1
|
|
065.804686 main [290] ------- zerocopy supported
|
|
065.804708 main [297] Wait 4 secs for link to come up...
|
|
075.810547 main [301] Ready to go, ixl0 0x0/4 <-> ixl1 0x0/4.
|
|
```
|
|
|
|
{{< image width="80px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
|
|
|
I start my first loadtest, which pretty immediately fails. It's an interesting behavior pattern which
|
|
I've not seen before. After staring at the problem, and reading the code of `bridge.c`, which is a
|
|
remarkably straight forward program, I restart the bridge utility, and traffic passes again but only
|
|
for a little while. Whoops!
|
|
|
|
I took a [[screencast](/assets/freebsd-vpp/netmap_bridge.cast)] in case any kind soul on freebsd-net
|
|
wants to take a closer look at this:
|
|
|
|
{{< image src="/assets/freebsd-vpp/netmap_bridge.gif" alt="FreeBSD netmap Bridge" >}}
|
|
|
|
I start a bit of trial and error in which I conclude that if I send **a lot** of traffic (like 10Mpps),
|
|
forwarding is fine; but if I send **a little** traffic (like 1kpps), at some point forwarding stops
|
|
alltogether. So while it's not great, this does allow me to measure the total throughput just by
|
|
sending a lot of traffic, say 30Mpps, and seeing what amount comes out the other side.
|
|
|
|
Here I go, and I'm having fun:
|
|
|
|
| Type | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Rate |
|
|
| ---- | --------- | ----------- | -------- | --------- |
|
|
| vm=var2,size=1514 | Unidirectional | 2.04Mpps | 24.72Gbps | 100% |
|
|
| vm=var2,size=imix | Unidirectional | 8.16Mpps | 23.76Gbps | 100% |
|
|
| vm=var2,size=64 | Unidirectional | 10.83Mpps | 5.55Gbps | 29% |
|
|
| size=64 | Unidirectional | 11.42Mpps | 5.83Gbps | 31% |
|
|
| vm=var2,size=1514 | Bidirectional | 3.91Mpps | 47.27Gbps | 96% |
|
|
| vm=var2,size=imix | Bidirectional | 11.31Mpps | 32.74Gbps | 77% |
|
|
| vm=var2,size=64 | Bidirectional | 11.39Mpps | 5.83Gbps | 15% |
|
|
| size=64 | Bidirectional | 11.57Mpps | 5.93Gbps | 16% |
|
|
|
|
***Conclusion***: FreeBSD's _netmap_ implementation is also bound by packets/sec, and in this
|
|
setup, the Xeon-D1518 machine is capable of forwarding roughly 11.2Mpps. What I find cool is that
|
|
single flow or multiple flows doesn't seem to matter that much, in fact bidirectional 64b single
|
|
flow loadtest was most favorable at 11.57Mpps, which is _an order of magnitude_ better than using just
|
|
the kernel (which clocked in at 1.2Mpps).
|
|
|
|
### FreeBSD 14: VPP with netmap
|
|
|
|
It's good to have a baseline on this machine on how the FreeBSD kernel itself performs. But of
|
|
course this series is about Vector Packet Processing, so I now turn my attention to the VPP branch
|
|
that Tom shared with me. I wrote a bunch of details about the VM and bare metal install in my
|
|
[[first article]({{< ref "2024-02-10-vpp-freebsd-1" >}})] so I'll just go straight to the
|
|
configuration parts:
|
|
|
|
```
|
|
DBGvpp# create netmap name ixl0
|
|
DBGvpp# create netmap name ixl1
|
|
DBGvpp# set int state netmap-ixl0 up
|
|
DBGvpp# set int state netmap-ixl1 up
|
|
DBGvpp# set int l2 xconnect netmap-ixl0 netmap-ixl1
|
|
DBGvpp# set int l2 xconnect netmap-ixl1 netmap-ixl0
|
|
|
|
DBGvpp# show int
|
|
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
|
|
local0 0 down 0/0/0/0
|
|
netmap-ixl0 1 up 9000/0/0/0 rx packets 25622
|
|
rx bytes 1537320
|
|
tx packets 25437
|
|
tx bytes 1526220
|
|
netmap-ixl1 2 up 9000/0/0/0 rx packets 25437
|
|
rx bytes 1526220
|
|
tx packets 25622
|
|
tx bytes 1537320
|
|
```
|
|
|
|
At this point I can pretty much rule out that the _netmap_ `bridge.c` is the issue, because a
|
|
few seconds after introducing 10Kpps of traffic and seeing it successfully pass, the loadtester
|
|
receives no more packets, even though T-Rex is still sending it. However, about a minute later
|
|
I can _also_ see the RX **and** TX counters continue to increase in the VPP dataplane:
|
|
|
|
```
|
|
DBGvpp# show int
|
|
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
|
|
local0 0 down 0/0/0/0
|
|
netmap-ixl0 1 up 9000/0/0/0 rx packets 515843
|
|
rx bytes 30950580
|
|
tx packets 515657
|
|
tx bytes 30939420
|
|
netmap-ixl1 2 up 9000/0/0/0 rx packets 515657
|
|
rx bytes 30939420
|
|
tx packets 515843
|
|
tx bytes 30950580
|
|
```
|
|
|
|
.. and I can see that every packet that VPP received is accounted for: interface `ixl0` has received
|
|
515843 packets, and `ixl1` claims to have transmitted _exactly_ that amount of packets. So I think
|
|
perhaps they are getting lost somewhere on egress between the kernel and the Intel i710-XXV network
|
|
card.
|
|
|
|
However, counter to the previous case, I cannot sustain any reasonable amount of traffic, be it
|
|
1Kpps, 10Kpps or 10Mpps, the system pretty consistently comes to a halt mere seconds after
|
|
introducing the load. Restarting VPP makes it forward traffic again for a few seconds, just to end
|
|
up in the same upset state. I don't learn much.
|
|
|
|
***Conclusion***: This setup with VPP using _netmap_ does not yield results, for the moment. I have a
|
|
suspicion that whatever the cause is of the _netmap_ bridge in the previous test, is likely also the
|
|
culprit for this test.
|
|
|
|
### FreeBSD 14: VPP with DPDK
|
|
|
|
But not all is lost - I have one test left, and judging by what I learned last week when bringing up
|
|
the first test environment, this one is going to be a fair bit better. In my previous loadtests, the
|
|
network interfaces were on their usual kernel driver (`ixl(4)` in the case of the Intel i710-XXV
|
|
interfaces), but now I'm going to mix it up a little, and rebind these interfaces to a specific DPDK
|
|
driver called `nic_uio(4)` which stands for _Network Interface Card Userspace Input/Output_:
|
|
|
|
```
|
|
[pim@france ~]$ cat < EOF | sudo tee -a /boot/loader.conf
|
|
nic_uio_load="YES"
|
|
hw.nic_uio.bdfs="6:0:0,6:0:1"
|
|
EOF
|
|
```
|
|
|
|
After I reboot, the network interfaces are gone from the output of `ifconfig(8)`, which is good. I
|
|
start up VPP with a minimal config file [[ref](/assets/freebsd-vpp/startup.conf)], which defines
|
|
three worker threads and starts DPDK with 3 RX queues and 4 TX queues. It's a common question why
|
|
there would be one more TX queue. The explanation is that in VPP, there is one (1) _main_ thread and
|
|
zero or more _worker_ threads. If the _main_ thread wants to send traffic (for example, in a plugin
|
|
like _LLDP_ which sends periodic announcements), it would be most efficient to use a transmit queue
|
|
specific to that _main_ thread. Any return traffic will be picked up by the _DPDK Process_ on worker
|
|
threads (as _main_ does not have one of these). That's why the general rule num(TX) = num(RX)+1.
|
|
|
|
```
|
|
[pim@france ~/src/vpp]$ export STARTUP_CONF=/home/pim/src/startup.conf
|
|
[pim@france ~/src/vpp]$ gmake run-release
|
|
|
|
vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/0 TwentyFiveGigabitEthernet6/0/1
|
|
vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/1 TwentyFiveGigabitEthernet6/0/0
|
|
vpp# set int state TwentyFiveGigabitEthernet6/0/0 up
|
|
vpp# set int state TwentyFiveGigabitEthernet6/0/1 up
|
|
vpp# show int
|
|
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
|
|
TwentyFiveGigabitEthernet6/0/0 1 up 9000/0/0/0 rx packets 11615035382
|
|
rx bytes 1785998048960
|
|
tx packets 700076496
|
|
tx bytes 161043604594
|
|
TwentyFiveGigabitEthernet6/0/1 2 up 9000/0/0/0 rx packets 700076542
|
|
rx bytes 161043674054
|
|
tx packets 11615035440
|
|
tx bytes 1785998136540
|
|
local0 0 down 0/0/0/0
|
|
|
|
```
|
|
|
|
And with that, the dataplane shoots to life and starts forwarding (lots of) packets. To my great
|
|
relief, sending either 1kpps or 1Mpps "just works". I can run my loadtest as per normal, first with
|
|
1514 byte packets, then imix, then 64 byte packets, and finally single-flow 64 byte packets. And of
|
|
course, both unidirectionally and bidirectionally.
|
|
|
|
I take a look at the system load while the loadtests are running:
|
|
|
|
{{< image src="/assets/freebsd-vpp/top-vpp-dpdk.png" alt="FreeBSD top" >}}
|
|
|
|
It is fully expected that the VPP process is spinning 300% +epsilon of CPU time. This is because it
|
|
has started three _worker_ threads, and these are execuing the DPDK _Poll Mode Driver_ which is
|
|
essentially a tight loop that asks the network cards for work, and if there are any packets
|
|
arriving, executes on that work. As such, each _worker_ thread is always burning 100% of its
|
|
assigned CPU.
|
|
|
|
That said, I can take a look at finer grained statistics in the dataplane itself:
|
|
|
|
|
|
```
|
|
vpp# show run
|
|
Thread 0 vpp_main (lcore 0)
|
|
Time .9, 10 sec internal node vector rate 0.00 loops/sec 297041.19
|
|
vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
ip4-full-reassembly-expire-wal any wait 0 0 18 2.39e3 0.00
|
|
ip6-full-reassembly-expire-wal any wait 0 0 18 3.08e3 0.00
|
|
unix-cli-process-0 active 0 0 9 7.62e4 0.00
|
|
unix-epoll-input polling 13066 0 0 1.50e5 0.00
|
|
---------------
|
|
Thread 1 vpp_wk_0 (lcore 1)
|
|
Time .9, 10 sec internal node vector rate 12.38 loops/sec 1467742.01
|
|
vector rates in 5.6294e6, out 5.6294e6, drop 0.0000e0, punt 0.0000e0
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
TwentyFiveGigabitEthernet6/0/1 active 399663 5047800 0 2.20e1 12.63
|
|
TwentyFiveGigabitEthernet6/0/1 active 399663 5047800 0 9.54e1 12.63
|
|
dpdk-input polling 1531252 5047800 0 1.45e2 3.29
|
|
ethernet-input active 399663 5047800 0 3.97e1 12.63
|
|
l2-input active 399663 5047800 0 2.93e1 12.63
|
|
l2-output active 399663 5047800 0 2.53e1 12.63
|
|
unix-epoll-input polling 1494 0 0 3.09e2 0.00
|
|
|
|
(et cetera)
|
|
```
|
|
I showed only one _worker_ thread's output, but there are actually three _worker_ threads, and they are
|
|
all doing similar work, because they are picking up 33% of the traffic each assigned to the three RX
|
|
queues in the network card.
|
|
|
|
While the overall CPU load is 300%, here I can see a different picture. Thread 0 (the _main_ thread)
|
|
is doing essentially ~nothing. It is polling a set of unix sockets in the node called
|
|
`unix-epoll-input`, but other than that, _main_ doesn't have much on its plate. Thread 1 however is
|
|
a _worker_ thread, and I can see that it is busy doing work:
|
|
|
|
* `dpdk-input`: it's polling the NIC for work, it has been called 1.53M times, and in total it has
|
|
handled just over 5.04M _vectors_ (which are packets). So I can derive, that each time the _Poll
|
|
Mode Driver_ gives work, on average there are 3.29 _vectors_ (packets), and each packet is
|
|
taking about 145 CPU clocks.
|
|
* `ethernet-input`: The DPDK vectors are all ethernet frames coming from the loadtester. Seeing as
|
|
I have cross connected all traffic from Tf6/0/0 to Tf6/0/1 and vice-versa, VPP knows that it
|
|
should handle the packets in the L2 forwarding path.
|
|
* `l2-input` is called with the (list of N) ethernet frames, which all get cross connected to the
|
|
output interface, in this case Tf6/0/1.
|
|
* `l2-output` prepares the ethernet frames for output into their egress interface.
|
|
* `TwentyFiveGigabitEthernet6/0/1-output` (**Note**: the name is truncated) If this were to have
|
|
been L3 traffic, this would be the place where the destination MAC address is inserted into the
|
|
ethernet frame, but since this is an L2 cross connect, the node simply passes the ethernet frames
|
|
through to the final egress node in DPDK.
|
|
* `TwentyFiveGigabitEthernet6/0/1-tx` (**Note**: the name is truncated) hands them to the DPDK
|
|
driver for marshalling on the wire.
|
|
|
|
Halfway through, I see that there's an issue with the distribution of ingress traffic over the
|
|
three workers, maybe you can spot it too:
|
|
|
|
```
|
|
---------------
|
|
Thread 1 vpp_wk_0 (lcore 1)
|
|
Time 56.7, 10 sec internal node vector rate 38.59 loops/sec 106879.84
|
|
vector rates in 7.2982e6, out 7.2982e6, drop 0.0000e0, punt 0.0000e0
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
TwentyFiveGigabitEthernet6/0/0 active 6689553 206899956 0 1.34e1 30.93
|
|
TwentyFiveGigabitEthernet6/0/0 active 6689553 206899956 0 1.37e2 30.93
|
|
TwentyFiveGigabitEthernet6/0/1 active 6688572 206902836 0 1.45e1 30.93
|
|
TwentyFiveGigabitEthernet6/0/1 active 6688572 206902836 0 1.34e2 30.93
|
|
dpdk-input polling 7128012 413802792 0 8.77e1 58.05
|
|
ethernet-input active 13378125 413802792 0 2.77e1 30.93
|
|
l2-input active 6809002 413802792 0 1.81e1 60.77
|
|
l2-output active 6809002 413802792 0 1.68e1 60.77
|
|
unix-epoll-input polling 6954 0 0 6.61e2 0.00
|
|
---------------
|
|
Thread 2 vpp_wk_1 (lcore 2)
|
|
Time 56.7, 10 sec internal node vector rate 256.00 loops/sec 7702.68
|
|
vector rates in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
TwentyFiveGigabitEthernet6/0/0 active 456112 116764672 0 1.27e1 256.00
|
|
TwentyFiveGigabitEthernet6/0/0 active 456112 116764672 0 2.64e2 256.00
|
|
TwentyFiveGigabitEthernet6/0/1 active 456112 116764672 0 1.39e1 256.00
|
|
TwentyFiveGigabitEthernet6/0/1 active 456112 116764672 0 2.74e2 256.00
|
|
dpdk-input polling 456112 233529344 0 1.41e2 512.00
|
|
ethernet-input active 912224 233529344 0 5.71e1 256.00
|
|
l2-input active 912224 233529344 0 3.66e1 256.00
|
|
l2-output active 912224 233529344 0 1.70e1 256.00
|
|
unix-epoll-input polling 445 0 0 9.59e2 0.00
|
|
---------------
|
|
Thread 3 vpp_wk_2 (lcore 3)
|
|
Time 56.7, 10 sec internal node vector rate 256.00 loops/sec 7742.43
|
|
vector rates in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
TwentyFiveGigabitEthernet6/0/0 active 456113 116764928 0 8.94e0 256.00
|
|
TwentyFiveGigabitEthernet6/0/0 active 456113 116764928 0 2.81e2 256.00
|
|
TwentyFiveGigabitEthernet6/0/1 active 456113 116764928 0 9.54e0 256.00
|
|
TwentyFiveGigabitEthernet6/0/1 active 456113 116764928 0 2.72e2 256.00
|
|
dpdk-input polling 456113 233529856 0 1.61e2 512.00
|
|
ethernet-input active 912226 233529856 0 4.50e1 256.00
|
|
l2-input active 912226 233529856 0 2.93e1 256.00
|
|
l2-output active 912226 233529856 0 1.23e1 256.00
|
|
unix-epoll-input polling 445 0 0 1.03e3 0.00
|
|
```
|
|
|
|
Thread 1 (`vpp_wk_0`) is handling 7.29Mpps and moderately loaded, while Thread 2 and 3 are handling
|
|
each 4.11Mpps and are completely pegged. That said, the relative amount of CPU clocks they are
|
|
spending per packet is reasonably similar, but they don't quite add up:
|
|
|
|
* Thread 1 is doing 7.29Mpps and is spending on average 449 CPU cycles per packet. I get this
|
|
number by adding up all of the values in the _Clocks_ column, except for the `unix-epoll-input`
|
|
node. But that's somewhat strange, because this Xeon D1518 clocks at 2.2GHz -- and yet 7.29M *
|
|
449 is 3.27GHz. My experience (in Linux) is that these numbers actually line up quite well.
|
|
* Thread 2 is doing 4.12Mpps and is spending on average 816 CPU cycles per packet. This kind of
|
|
makes sense as the cycles/packet is roughly double that of thread 1, and the packet/sec is
|
|
roughly half ... and the total of 4.12M * 816 is 3.36GHz.
|
|
* I can see similarly values for thread 3: 4.12Mpps and also 819 CPU cycles per packet which
|
|
amounts to VPP self-reporting using 3.37GHz worth of cycles on this thread.
|
|
|
|
When I look at the thread to CPU placement, I get another surprise:
|
|
|
|
```
|
|
vpp# show threads
|
|
ID Name Type LWP Sched Policy (Priority) lcore Core Socket State
|
|
0 vpp_main 100346 (nil) (n/a) 0 42949674294967
|
|
1 vpp_wk_0 workers 100473 (nil) (n/a) 1 42949674294967
|
|
2 vpp_wk_1 workers 100474 (nil) (n/a) 2 42949674294967
|
|
3 vpp_wk_2 workers 100475 (nil) (n/a) 3 42949674294967
|
|
|
|
vpp# show cpu
|
|
Model name: Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz
|
|
Microarch model (family): [0x6] Broadwell ([0x56] Broadwell DE) stepping 0x3
|
|
Flags: sse3 pclmulqdq ssse3 sse41 sse42 avx rdrand avx2 bmi2 rtm pqm pqe
|
|
rdseed aes invariant_tsc
|
|
Base frequency: 2.19 GHz
|
|
```
|
|
|
|
The numbers in `show threads` are all messed up, and I don't quite know what to make of it yet. I
|
|
think the perhaps overly specific Linux implementation of the thread pool management is throwing off
|
|
FreeBSD a bit. Perhaps some profiling could be useful, so I make a note to discuss this with Tom or
|
|
the freebsd-net mailing list, who will know a fair bit more about this type of stuff on FreeBSD than
|
|
I do.
|
|
|
|
Anyway, functionally: this works. Performance wise: I have some questions :-) I let all eight
|
|
loadtests complete and without further ado, here's the results:
|
|
|
|
| Type | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Rate |
|
|
| ---- | --------- | ----------- | -------- | --------- |
|
|
| vm=var2,size=1514 | Unidirectional | 2.01Mpps | 24.45Gbps | 99% |
|
|
| vm=var2,size=imix | Unidirectional | 8.07Mpps | 23.42Gbps | 99% |
|
|
| vm=var2,size=64 | Unidirectional | 23.93Mpps | 12.25Gbps | 64% |
|
|
| size=64 | Unidirectional | 12.80Mpps | 6.56Gbps | 34% |
|
|
| vm=var2,size=1514 | Bidirectional | 3.91Mpps | 47.35Gbps | 86% |
|
|
| vm=var2,size=imix | Bidirectional | 13.38Mpps | 38.81Gbps | 82% |
|
|
| vm=var2,size=64 | Bidirectional | 15.56Mpps | 7.97Gbps | 21% |
|
|
| size=64 | Bidirectional | 20.96Mpps | 10.73Gbps | 28% |
|
|
|
|
***Conclusion***: I have to say: 12.8Mpps on a unidirectional 64b single-flow loadtest (thereby only
|
|
being able to make use of one DPDK worker), and 20.96Mpps on a bidirectional 64b single-flow
|
|
loadtest, is not too shabby. But seeing as one CPU thread can do 12.8Mpps, I would imagine that
|
|
three CPU threads would perform at 38.4Mpps or there-abouts, but I'm seeing only 23.9Mpps and some
|
|
unexplained variance in per-thread performance.
|
|
|
|
## Results
|
|
|
|
I learned a lot! Some hilights:
|
|
1. The _netmap_ implementation is not playing ball for the moment, as forwarding stops consistently, in
|
|
both the `bridge.c` as well as the VPP plugin.
|
|
1. It is clear though, that _netmap_ is a fair bit faster (11.4Mpps) than _kernel forwarding_ which came in at
|
|
roughly 1.2Mpps per CPU thread. What's a bit troubling is that _netmap_ doesn't seem to work
|
|
very well in VPP -- traffic forwarding also stops here.
|
|
1. DPDK performs quite well on FreeBSD, I manage to see a throughput of 20.96Mpps which is almost
|
|
twice the throughput of _netmap_, which is cool but I can't quite explain the stark variance
|
|
in throughput between the worker threads. Perhaps VPP is placing the workers on hyperthreads?
|
|
Perhaps an equivalent of `isolcpus` in the Linux kernel would help?
|
|
|
|
For the curious, I've bundled up a few files that describe the machine and its setup:
|
|
[[dmesg](/assets/freebsd-vpp/france-dmesg.txt)]
|
|
[[pciconf](/assets/freebsd-vpp/france-pciconf.txt)]
|
|
[[loader.conf](/assets/freebsd-vpp/france-loader.conf)]
|
|
[[VPP startup.conf](/assets/freebsd-vpp/france-startup.conf)]
|