---
date: "2024-02-17T12:17:54Z"
title: VPP on FreeBSD - Part 2
aliases:
- /s/articles/2024/02/17/vpp-freebsd-2.html
---

# About this series

{{< image width="300px" float="right" src="/assets/freebsd-vpp/freebsd-logo.png" alt="FreeBSD" >}}

Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. Over the years, folks have asked me regularly "What about BSD?" and to
my surprise, late last year I read an announcement from the _FreeBSD Foundation_
[[ref](https://freebsdfoundation.org/blog/2023-in-review-software-development/)] as they looked back
over 2023 and forward to 2024:

> ***Porting the Vector Packet Processor to FreeBSD***
>
> Vector Packet Processing (VPP) is an open-source, high-performance user space networking stack
> that provides fast packet processing suitable for software-defined networking and network function
> virtualization applications. VPP aims to optimize packet processing through vectorized operations
> and parallelism, making it well-suited for high-speed networking applications. In November of this
> year, the Foundation began a contract with Tom Jones, a FreeBSD developer specializing in network
> performance, to port VPP to FreeBSD. Under the contract, Tom will also allocate time for other
> tasks such as testing FreeBSD on common virtualization platforms to improve the desktop
> experience, improving hardware support on arm64 platforms, and adding support for low power idle
> on Intel and arm64 hardware.

In my first [[article]({{< ref "2024-02-10-vpp-freebsd-1" >}})], I wrote a sort of a _hello world_
by installing FreeBSD 14.0-RELEASE on both a VM and a bare metal Supermicro, and showed that Tom's
VPP branch compiles, runs and pings. In this article, I'll take a look at some comparative
performance numbers.

## Comparing implementations

FreeBSD has an extensive network stack, including regular _kernel_ based functionality such as
routing, filtering and bridging, a faster _netmap_ based datapath, including some userspace
utilities like a _netmap_ bridge, and of course completely userspace based dataplanes, such as the
VPP project that I'm working on here. Last week, I learned that VPP has a _netmap_ driver, and from
previous travels I am already quite familiar with its _DPDK_ based forwarding. I decide to do a
baseline loadtest for each of these on the Supermicro Xeon-D1518 that I installed last week. See the
[[article]({{< ref "2024-02-10-vpp-freebsd-1" >}})] for details on the setup.

The loadtests will use a common set of different configurations, using Cisco T-Rex's default
benchmark profile called `bench.py`:
1.   **var2-1514b**: Large Packets, multiple flows with modulating source and destination IPv4
     addresses, often called an 'iperf test', with packets of 1514 bytes.
1.   **var2-imix**: Mixed Packets, multiple flows, often called an 'imix test', which includes a
     bunch of 64b, 390b and 1514b packets.
1.   **var2-64b**: Small Packets, still multiple flows, 64 bytes, which allows for multiple receive
     queues and kernel or application threads.
1.   **64b**: Small Packets, but now single flow, often called 'linerate test', with a packet size
     of 64 bytes, limiting to one receive queue.

Each of these four loadtests might occur in only undirectionally (port0 -> port1) or bidirectionally
(port0 <-> port1). This yields eight different loadtests, each taking about 8 minutes. I put the kettle
on and get underway.

### FreeBSD 14: Kernel Bridge

The machine I'm testing has a quad-port Intel i350 (1Gbps copper, using the FreeBSD `igb(4)` driver),
a dual-port Intel X522 (10Gbps SFP+, using the `ix(4)` driver), and a dual-port Intel i710-XXV
(25Gbps SFP28, using the `ixl(4)` driver). I decide to live it up a little, and choose the 25G ports
for my loadtests today, even if I think this machine with its relatively low-end Xeon-D1518 CPU
will struggle a little bit at very high packet rates. No pain, no gain, _amirite_?

I take my fresh FreeBSD 14.0-RELEASE install, without any tinkering other than compiling a GENERIC
kernel that has support for the DPDK modules I'll need later. For my first loadtest, I create a
kernel based bridge as follows, just tying the two 25G interfaces together:

```
[pim@france /usr/obj]$ uname -a
FreeBSD france 14.0-RELEASE FreeBSD 14.0-RELEASE #0: Sat Feb 10 22:18:51 CET 2024     root@france:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64

[pim@france ~]$ dmesg | grep ixl
ixl0: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k> mem 0xf8000000-0xf8ffffff,0xf9008000-0xf900ffff irq 16 at device 0.0 on pci7
ixl1: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k> mem 0xf7000000-0xf7ffffff,0xf9000000-0xf9007fff irq 16 at device 0.1 on pci7

[pim@france ~]$ sudo ifconfig bridge0 create
[pim@france ~]$ sudo ifconfig bridge0 addm ixl0 addm ixl1 up
[pim@france ~]$ sudo ifconfig ixl0 up
[pim@france ~]$ sudo ifconfig ixl1 up
[pim@france ~]$ ifconfig bridge0
bridge0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
	options=0
	ether 58:9c:fc:10:6c:2e
	id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
	maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
	root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
	member: ixl1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 4 priority 128 path cost 800
	member: ixl0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 3 priority 128 path cost 800
	groups: bridge
	nd6 options=9<PERFORMNUD,IFDISABLED>

```

One thing that I quickly realize, is that FreeBSD, when using hyperthreading, does have 8 threads
available, but only 4 of them participate in forwarding. When I put the machine under load, I see a
curious 399% spent in _kernel_ while I see 402% in _idle_:

{{< image src="/assets/freebsd-vpp/top-kernel-bridge.png" alt="FreeBSD top" >}}

When I then do a single-flow unidirectional loadtest, the expected outcome is that only one CPU
participates (100% in _kernel_ and 700% in _idle_) and if I perform a single-flow bidirectional
loadtest, my expectations are confirmed again, seeing two CPU threads do the work (200% in _kernel_
and 600% in _idle_).

While the math checks out, the performance is a little bit less impressive:

| Type | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Rate |
| ---- | --------- | ----------- | -------- | --------- |
| vm=var2,size=1514 | Unidirectional | 2.02Mpps | 24.77Gbps | 99% |
| vm=var2,size=imix | Unidirectional | 3.48Mpps | 10.23Gbps | 43% |
| vm=var2,size=64 | Unidirectional | 3.61Mpps | 2.43Gbps | 9.7% |
| size=64 | Unidirectional | 1.22Mpps | 0.82Gbps | 3.2% |
| vm=var2,size=1514 | Bidirectional | 3.77Mpps | 46.31Gbps | 93% |
| vm=var2,size=imix | Bidirectional | 3.81Mpps | 11.22Gbps | 24% |
| vm=var2,size=64 | Bidirectional | 4.02Mpps | 2.69Gbps | 5.4% |
| size=64 | Bidirectional | 2.29Mpps | 1.54Gbps | 3.1% |

***Conclusion***: FreeBSD's kernel on this Xeon-D1518 processor can handle about 1.2Mpps per CPU
thread, and I can use only four of them. FreeBSD is happy to forward big packets, and I can
reasonably reach 2x25Gbps but once I start ramping up the packets/sec by lowering the packet size,
things very quickly deteriorate.

### FreeBSD 14: netmap Bridge

Tom pointed out a tool in the source tree, called the _netmap bridge_ originally written by Luigi
Rizzo and Matteo Landi. FreeBSD ships the source code, but you can also take a look at their GitHub
repository [[ref](https://github.com/luigirizzo/netmap/blob/master/apps/bridge/bridge.c)].

What is _netmap_ anyway? It's a framework for extremely fast and efficient packet I/O for userspace
and kernel clients, and for Virtual Machines. It runs on FreeBSD, Linux and some versions of
Windows. As an aside, my buddy Pavel from FastNetMon pointed out a blogpost from 2015 in which
Cloudflare folks described a way to do DDoS mitigation on Linux using traffic classification to
program the network cards to move certain offensive traffic to a dedicated hardware queue, and
service that queue from a _netmap_ client. If you're curious (I certainly was!), you might take a
look at that cool write-up
[[here](https://blog.cloudflare.com/single-rx-queue-kernel-bypass-with-netmap)].

I compile the code and put it to work, and the man-page tells me that I need to fiddle with the
interfaces a bit. They need to be:

*    set to _promiscuous_, which makes sense as they have to receive ethernet frames sent to MAC
     addresses other than their own
*    turn off any hardware offloading, notably `-rxcsum -txcsum -tso4 -tso6 -lro`
*    my user needs write permission to `/dev/netmap` to bind the interfaces from userspace.

```
[pim@france /usr/src/tools/tools/netmap]$ make
[pim@france /usr/src/tools/tools/netmap]$ cd /usr/obj/usr/src/amd64.amd64/tools/tools/netmap
[pim@france .../tools/netmap]$ sudo ifconfig ixl0 -rxcsum -txcsum -tso4 -tso6 -lro promisc
[pim@france .../tools/netmap]$ sudo ifconfig ixl1 -rxcsum -txcsum -tso4 -tso6 -lro promisc
[pim@france .../tools/netmap]$ sudo chmod 660 /dev/netmap
[pim@france .../tools/netmap]$ ./bridge -i netmap:ixl0 -i netmap:ixl1
065.804686 main [290] ------- zerocopy supported
065.804708 main [297] Wait 4 secs for link to come up...
075.810547 main [301] Ready to go, ixl0 0x0/4 <-> ixl1 0x0/4.
```

{{< image width="80px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}

I start my first loadtest, which pretty immediately fails. It's an interesting behavior pattern which
I've not seen before. After staring at the problem, and reading the code of `bridge.c`, which is a
remarkably straight forward program, I restart the bridge utility, and traffic passes again but only
for a little while. Whoops!

I took a [[screencast](/assets/freebsd-vpp/netmap_bridge.cast)] in case any kind soul on freebsd-net
wants to take a closer look at this:

{{< image src="/assets/freebsd-vpp/netmap_bridge.gif" alt="FreeBSD netmap Bridge" >}}

I start a bit of trial and error in which I conclude that if I send **a lot** of traffic (like 10Mpps),
forwarding is fine; but if I send **a little** traffic (like 1kpps), at some point forwarding stops
alltogether. So while it's not great, this does allow me to measure the total throughput just by
sending a lot of traffic, say 30Mpps, and seeing what amount comes out the other side.

Here I go, and I'm having fun:

| Type | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Rate |
| ---- | --------- | ----------- | -------- | --------- |
| vm=var2,size=1514 | Unidirectional | 2.04Mpps | 24.72Gbps | 100% |
| vm=var2,size=imix | Unidirectional | 8.16Mpps | 23.76Gbps | 100% |
| vm=var2,size=64 | Unidirectional | 10.83Mpps | 5.55Gbps | 29% |
| size=64 | Unidirectional | 11.42Mpps | 5.83Gbps | 31% |
| vm=var2,size=1514 | Bidirectional | 3.91Mpps | 47.27Gbps | 96% |
| vm=var2,size=imix | Bidirectional | 11.31Mpps | 32.74Gbps | 77% |
| vm=var2,size=64 | Bidirectional | 11.39Mpps | 5.83Gbps | 15% |
| size=64 | Bidirectional | 11.57Mpps | 5.93Gbps | 16% |

***Conclusion***: FreeBSD's _netmap_ implementation is also bound by packets/sec, and in this
setup, the Xeon-D1518 machine is capable of forwarding roughly 11.2Mpps. What I find cool is that
single flow or multiple flows doesn't seem to matter that much, in fact bidirectional 64b single
flow loadtest was most favorable at 11.57Mpps, which is _an order of magnitude_ better than using just
the kernel (which clocked in at 1.2Mpps).

### FreeBSD 14: VPP with netmap

It's good to have a baseline on this machine on how the FreeBSD kernel itself performs. But of
course this series is about Vector Packet Processing, so I now turn my attention to the VPP branch
that Tom shared with me. I wrote a bunch of details about the VM and bare metal install in my
[[first article]({{< ref "2024-02-10-vpp-freebsd-1" >}})] so I'll just go straight to the
configuration parts:

```
DBGvpp# create netmap name ixl0
DBGvpp# create netmap name ixl1
DBGvpp# set int state netmap-ixl0 up
DBGvpp# set int state netmap-ixl1 up
DBGvpp# set int l2 xconnect netmap-ixl0 netmap-ixl1
DBGvpp# set int l2 xconnect netmap-ixl1 netmap-ixl0

DBGvpp# show int
    Name       Idx    State  MTU (L3/IP4/IP6/MPLS)   Counter      Count
local0         0     down          0/0/0/0
netmap-ixl0    1      up          9000/0/0/0     rx packets       25622
                                                 rx bytes       1537320
                                                 tx packets       25437
                                                 tx bytes       1526220
netmap-ixl1    2      up          9000/0/0/0     rx packets       25437
                                                 rx bytes       1526220
                                                 tx packets       25622
                                                 tx bytes       1537320
```

At this point I can pretty much rule out that the _netmap_ `bridge.c` is the issue, because a
few seconds after introducing 10Kpps of traffic and seeing it successfully pass, the loadtester
receives no more packets, even though T-Rex is still sending it. However, about a minute later
I can _also_ see the RX **and** TX counters continue to increase in the VPP dataplane:

```
DBGvpp# show int
    Name       Idx    State  MTU (L3/IP4/IP6/MPLS)   Counter      Count
local0         0     down          0/0/0/0
netmap-ixl0    1      up          9000/0/0/0     rx packets      515843
                                                 rx bytes      30950580
                                                 tx packets      515657
                                                 tx bytes      30939420
netmap-ixl1    2      up          9000/0/0/0     rx packets      515657
                                                 rx bytes      30939420
                                                 tx packets      515843
                                                 tx bytes      30950580
```

.. and I can see that every packet that VPP received is accounted for: interface `ixl0` has received
515843 packets, and `ixl1` claims to have transmitted _exactly_ that amount of packets. So I think
perhaps they are getting lost somewhere on egress between the kernel and the Intel i710-XXV network
card.

However, counter to the previous case, I cannot sustain any reasonable amount of traffic, be it
1Kpps, 10Kpps or 10Mpps, the system pretty consistently comes to a halt mere seconds after
introducing the load. Restarting VPP makes it forward traffic again for a few seconds, just to end
up in the same upset state. I don't learn much.

***Conclusion***: This setup with VPP using _netmap_ does not yield results, for the moment. I have a
suspicion that whatever the cause is of the _netmap_ bridge in the previous test, is likely also the
culprit for this test.

### FreeBSD 14: VPP with DPDK

But not all is lost - I have one test left, and judging by what I learned last week when bringing up
the first test environment, this one is going to be a fair bit better. In my previous loadtests, the
network interfaces were on their usual kernel driver (`ixl(4)` in the case of the Intel i710-XXV
interfaces), but now I'm going to mix it up a little, and rebind these interfaces to a specific DPDK
driver called `nic_uio(4)` which stands for _Network Interface Card Userspace Input/Output_:

```
[pim@france ~]$ cat < EOF | sudo tee -a /boot/loader.conf
nic_uio_load="YES"
hw.nic_uio.bdfs="6:0:0,6:0:1"
EOF
```

After I reboot, the network interfaces are gone from the output of `ifconfig(8)`, which is good. I
start up VPP with a minimal config file [[ref](/assets/freebsd-vpp/startup.conf)], which defines
three worker threads and starts DPDK with 3 RX queues and 4 TX queues. It's a common question why
there would be one more TX queue. The explanation is that in VPP, there is one (1) _main_ thread and
zero or more _worker_ threads. If the _main_ thread wants to send traffic (for example, in a plugin
like _LLDP_ which sends periodic announcements), it would be most efficient to use a transmit queue
specific to that _main_ thread. Any return traffic will be picked up by the _DPDK Process_ on worker
threads (as _main_ does not have one of these). That's why the general rule num(TX) = num(RX)+1.

```
[pim@france ~/src/vpp]$ export STARTUP_CONF=/home/pim/src/startup.conf
[pim@france ~/src/vpp]$ gmake run-release

vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/0 TwentyFiveGigabitEthernet6/0/1
vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/1 TwentyFiveGigabitEthernet6/0/0
vpp# set int state TwentyFiveGigabitEthernet6/0/0 up
vpp# set int state TwentyFiveGigabitEthernet6/0/1 up
vpp# show int
              Name               Idx    State  MTU (L3/IP4/IP6/MPLS)     Counter          Count
TwentyFiveGigabitEthernet6/0/0    1      up          9000/0/0/0     rx packets           11615035382
                                                                    rx bytes           1785998048960
                                                                    tx packets             700076496
                                                                    tx bytes            161043604594
TwentyFiveGigabitEthernet6/0/1    2      up          9000/0/0/0     rx packets             700076542
                                                                    rx bytes            161043674054
                                                                    tx packets           11615035440
                                                                    tx bytes           1785998136540
local0                            0     down          0/0/0/0

```

And with that, the dataplane shoots to life and starts forwarding (lots of) packets. To my great
relief, sending either 1kpps or 1Mpps "just works". I can run my loadtest as per normal, first with
1514 byte packets, then imix, then 64 byte packets, and finally single-flow 64 byte packets. And of
course, both unidirectionally and bidirectionally.

I take a look at the system load while the loadtests are running:

{{< image src="/assets/freebsd-vpp/top-vpp-dpdk.png" alt="FreeBSD top" >}}

It is fully expected that the VPP process is spinning 300% +epsilon of CPU time. This is because it
has started three _worker_ threads, and these are execuing the DPDK _Poll Mode Driver_ which is
essentially a tight loop that asks the network cards for work, and if there are any packets
arriving, executes on that work. As such, each _worker_ thread is always burning 100% of its
assigned CPU.

That said, I can take a look at finer grained statistics in the dataplane itself:


```
vpp# show run
Thread 0 vpp_main (lcore 0)
Time .9, 10 sec internal node vector rate 0.00 loops/sec 297041.19
  vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls   Vectors   Suspends   Clocks  Vectors/Call
ip4-full-reassembly-expire-wal  any wait            0         0         18   2.39e3          0.00
ip6-full-reassembly-expire-wal  any wait            0         0         18   3.08e3          0.00
unix-cli-process-0               active             0         0          9   7.62e4          0.00
unix-epoll-input                 polling        13066         0          0   1.50e5          0.00
---------------
Thread 1 vpp_wk_0 (lcore 1)
Time .9, 10 sec internal node vector rate 12.38 loops/sec 1467742.01
  vector rates in 5.6294e6, out 5.6294e6, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls   Vectors   Suspends   Clocks  Vectors/Call
TwentyFiveGigabitEthernet6/0/1   active        399663   5047800          0   2.20e1         12.63
TwentyFiveGigabitEthernet6/0/1   active        399663   5047800          0   9.54e1         12.63
dpdk-input                       polling      1531252   5047800          0   1.45e2          3.29
ethernet-input                   active        399663   5047800          0   3.97e1         12.63
l2-input                         active        399663   5047800          0   2.93e1         12.63
l2-output                        active        399663   5047800          0   2.53e1         12.63
unix-epoll-input                 polling         1494         0          0   3.09e2          0.00

(et cetera)
```
I showed only one _worker_ thread's output, but there are actually three _worker_ threads, and they are
all doing similar work, because they are picking up 33% of the traffic each assigned to the three RX
queues in the network card.

While the overall CPU load is 300%, here I can see a different picture. Thread 0 (the _main_ thread)
is doing essentially ~nothing. It is polling a set of unix sockets in the node called
`unix-epoll-input`, but other than that, _main_ doesn't have much on its plate. Thread 1 however is
a _worker_ thread, and I can see that it is busy doing work:

*   `dpdk-input`: it's polling the NIC for work, it has been called 1.53M times, and in total it has
    handled just over 5.04M _vectors_ (which are packets). So I can derive, that each time the _Poll
    Mode Driver_  gives work, on average there are 3.29 _vectors_ (packets), and each packet is
    taking about 145 CPU clocks.
*   `ethernet-input`: The DPDK vectors are all ethernet frames coming from the loadtester. Seeing as
    I have cross connected all traffic from Tf6/0/0 to Tf6/0/1 and vice-versa, VPP knows that it
    should handle the packets in the L2 forwarding path.
*   `l2-input` is called with the (list of N) ethernet frames, which all get cross connected to the
    output interface, in this case Tf6/0/1.
*   `l2-output` prepares the ethernet frames for output into their egress interface.
*   `TwentyFiveGigabitEthernet6/0/1-output` (**Note**: the name is truncated) If this were to have
    been L3 traffic, this would be the place where the destination MAC address is inserted into the
    ethernet frame, but since this is an L2 cross connect, the node simply passes the ethernet frames
    through to the final egress node in DPDK.
*   `TwentyFiveGigabitEthernet6/0/1-tx` (**Note**: the name is truncated) hands them to the DPDK
    driver for marshalling on the wire.

Halfway through, I see that there's an issue with the distribution of ingress traffic over the
three workers, maybe you can spot it too:

```
---------------
Thread 1 vpp_wk_0 (lcore 1)
Time 56.7, 10 sec internal node vector rate 38.59 loops/sec 106879.84
  vector rates in 7.2982e6, out 7.2982e6, drop 0.0000e0, punt 0.0000e0
             Name                 State       Calls     Vectors  Suspends   Clocks  Vectors/Call
TwentyFiveGigabitEthernet6/0/0   active     6689553   206899956         0   1.34e1         30.93
TwentyFiveGigabitEthernet6/0/0   active     6689553   206899956         0   1.37e2         30.93
TwentyFiveGigabitEthernet6/0/1   active     6688572   206902836         0   1.45e1         30.93
TwentyFiveGigabitEthernet6/0/1   active     6688572   206902836         0   1.34e2         30.93
dpdk-input                       polling    7128012   413802792         0   8.77e1         58.05
ethernet-input                   active    13378125   413802792         0   2.77e1         30.93
l2-input                         active     6809002   413802792         0   1.81e1         60.77
l2-output                        active     6809002   413802792         0   1.68e1         60.77
unix-epoll-input                 polling       6954           0         0   6.61e2          0.00
---------------
Thread 2 vpp_wk_1 (lcore 2)
Time 56.7, 10 sec internal node vector rate 256.00 loops/sec 7702.68
  vector rates in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0
             Name                 State       Calls     Vectors  Suspends   Clocks  Vectors/Call
TwentyFiveGigabitEthernet6/0/0   active      456112   116764672         0   1.27e1        256.00
TwentyFiveGigabitEthernet6/0/0   active      456112   116764672         0   2.64e2        256.00
TwentyFiveGigabitEthernet6/0/1   active      456112   116764672         0   1.39e1        256.00
TwentyFiveGigabitEthernet6/0/1   active      456112   116764672         0   2.74e2        256.00
dpdk-input                       polling     456112   233529344         0   1.41e2        512.00
ethernet-input                   active      912224   233529344         0   5.71e1        256.00
l2-input                         active      912224   233529344         0   3.66e1        256.00
l2-output                        active      912224   233529344         0   1.70e1        256.00
unix-epoll-input                 polling        445           0         0   9.59e2          0.00
---------------
Thread 3 vpp_wk_2 (lcore 3)
Time 56.7, 10 sec internal node vector rate 256.00 loops/sec 7742.43
  vector rates in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0
             Name                 State       Calls     Vectors  Suspends  Clocks  Vectors/Call
TwentyFiveGigabitEthernet6/0/0   active      456113   116764928         0  8.94e0        256.00
TwentyFiveGigabitEthernet6/0/0   active      456113   116764928         0  2.81e2        256.00
TwentyFiveGigabitEthernet6/0/1   active      456113   116764928         0  9.54e0        256.00
TwentyFiveGigabitEthernet6/0/1   active      456113   116764928         0  2.72e2        256.00
dpdk-input                       polling     456113   233529856         0  1.61e2        512.00
ethernet-input                   active      912226   233529856         0  4.50e1        256.00
l2-input                         active      912226   233529856         0  2.93e1        256.00
l2-output                        active      912226   233529856         0  1.23e1        256.00
unix-epoll-input                 polling        445           0         0  1.03e3          0.00
```

Thread 1 (`vpp_wk_0`) is handling 7.29Mpps and moderately loaded, while Thread 2 and 3 are handling
each 4.11Mpps and are completely pegged. That said, the relative amount of CPU clocks they are
spending per packet is reasonably similar, but they don't quite add up:

*  Thread 1 is doing 7.29Mpps and is spending on average 449 CPU cycles per packet. I get this
   number by adding up all of the values in the _Clocks_ column, except for the `unix-epoll-input`
   node. But that's somewhat strange, because this Xeon D1518 clocks at 2.2GHz -- and yet 7.29M *
   449 is 3.27GHz. My experience (in Linux) is that these numbers actually line up quite well.
*  Thread 2 is doing 4.12Mpps and is spending on average 816 CPU cycles per packet. This kind of
   makes sense as the cycles/packet is roughly double that of thread 1, and the packet/sec is
   roughly half ... and the total of 4.12M * 816 is 3.36GHz.
*  I can see similarly values for thread 3: 4.12Mpps and also 819 CPU cycles per packet which
   amounts to VPP self-reporting using 3.37GHz worth of cycles on this thread.

When I look at the thread to CPU placement, I get another surprise:

```
vpp# show threads
ID     Name                Type        LWP     Sched Policy (Priority)  lcore  Core   Socket State
0      vpp_main                        100346  (nil) (n/a)              0      42949674294967
1      vpp_wk_0            workers     100473  (nil) (n/a)              1      42949674294967
2      vpp_wk_1            workers     100474  (nil) (n/a)              2      42949674294967
3      vpp_wk_2            workers     100475  (nil) (n/a)              3      42949674294967

vpp# show cpu
Model name:               Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz
Microarch model (family): [0x6] Broadwell ([0x56] Broadwell DE) stepping 0x3
Flags:                    sse3 pclmulqdq ssse3 sse41 sse42 avx rdrand avx2 bmi2 rtm pqm pqe
                          rdseed aes invariant_tsc
Base frequency:           2.19 GHz
```

The numbers in `show threads` are all messed up, and I don't quite know what to make of it yet. I
think the perhaps overly specific Linux implementation of the thread pool management is throwing off
FreeBSD a bit. Perhaps some profiling could be useful, so I make a note to discuss this with Tom or
the freebsd-net mailing list, who will know a fair bit more about this type of stuff on FreeBSD than
I do.

Anyway, functionally: this works. Performance wise: I have some questions :-) I let all eight
loadtests complete and without further ado, here's the results:

| Type | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Rate |
| ---- | --------- | ----------- | -------- | --------- |
| vm=var2,size=1514 | Unidirectional | 2.01Mpps | 24.45Gbps | 99% |
| vm=var2,size=imix | Unidirectional | 8.07Mpps | 23.42Gbps | 99% |
| vm=var2,size=64 | Unidirectional | 23.93Mpps | 12.25Gbps | 64% |
| size=64 | Unidirectional | 12.80Mpps | 6.56Gbps | 34% |
| vm=var2,size=1514 | Bidirectional | 3.91Mpps | 47.35Gbps | 86% |
| vm=var2,size=imix | Bidirectional | 13.38Mpps | 38.81Gbps | 82% |
| vm=var2,size=64 | Bidirectional | 15.56Mpps | 7.97Gbps | 21% |
| size=64 | Bidirectional | 20.96Mpps | 10.73Gbps | 28% |

***Conclusion***: I have to say: 12.8Mpps on a unidirectional 64b single-flow loadtest (thereby only
being able to make use of one DPDK worker), and 20.96Mpps on a bidirectional 64b single-flow
loadtest, is not too shabby. But seeing as one CPU thread can do 12.8Mpps, I would imagine that
three CPU threads would perform at 38.4Mpps or there-abouts, but I'm seeing only 23.9Mpps and some
unexplained variance in per-thread performance.

## Results

I learned a lot! Some hilights:
1.  The _netmap_ implementation is not playing ball for the moment, as forwarding stops consistently, in
    both the `bridge.c` as well as the VPP plugin.
1.  It is clear though, that _netmap_ is a fair bit faster (11.4Mpps) than _kernel forwarding_ which came in at
    roughly 1.2Mpps per CPU thread. What's a bit troubling is that _netmap_ doesn't seem to work
    very well in VPP -- traffic forwarding also stops here.
1.  DPDK performs quite well on FreeBSD, I manage to see a throughput of 20.96Mpps which is almost
    twice the throughput of _netmap_, which is cool but I can't quite explain the stark variance
    in throughput between the worker threads. Perhaps VPP is placing the workers on hyperthreads?
    Perhaps an equivalent of `isolcpus` in the Linux kernel would help?

For the curious, I've bundled up a few files that describe the machine and its setup:
[[dmesg](/assets/freebsd-vpp/france-dmesg.txt)]
[[pciconf](/assets/freebsd-vpp/france-pciconf.txt)]
[[loader.conf](/assets/freebsd-vpp/france-loader.conf)]
[[VPP startup.conf](/assets/freebsd-vpp/france-startup.conf)]