This commit is contained in:
544
content/articles/2024-10-06-sflow-2.md
Normal file
544
content/articles/2024-10-06-sflow-2.md
Normal file
@ -0,0 +1,544 @@
|
|||||||
|
---
|
||||||
|
date: "2024-10-06T07:51:23Z"
|
||||||
|
title: 'VPP with sFlow - Part 2'
|
||||||
|
---
|
||||||
|
|
||||||
|
# Introduction
|
||||||
|
|
||||||
|
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
|
||||||
|
|
||||||
|
Last month, I picked up a project together with Neil McKee of [[inMon](https://inmon.com/)], the
|
||||||
|
care takers of [[sFlow](https://sflow.org)]: an industry standard technology for monitoring high speed switched
|
||||||
|
networks. `sFlow` gives complete visibility into the use of networks enabling performance optimization,
|
||||||
|
accounting/billing for usage, and defense against security threats.
|
||||||
|
|
||||||
|
The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
|
||||||
|
forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
|
||||||
|
[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the so
|
||||||
|
called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for a small
|
||||||
|
portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but also in the
|
||||||
|
VPP software dataplane, and then _transmit_ these samples using a Linux kernel feature called
|
||||||
|
[[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)]. This greatly
|
||||||
|
reduces the complexity of code to be implemented in the forwarding path, while at the same time
|
||||||
|
bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business logic for
|
||||||
|
the more complex state keeping, packet marshalling and transmission from the _Agent_ to a central
|
||||||
|
_Collector_.
|
||||||
|
|
||||||
|
Last month, Neil and I discussed the proof of concept [[ref](https://github.com/sflow/vpp-sflow/)]
|
||||||
|
and I described this in a [[first article]({{< ref 2024-09-08-sflow-1.md >}})]. Then, we iterated on
|
||||||
|
the VPP plugin, playing with a few different approaches to strike a balance between performance, code
|
||||||
|
complexity, and agent features. This article describes our journey.
|
||||||
|
|
||||||
|
## VPP: an sFlow plugin
|
||||||
|
|
||||||
|
There are three things Neil and I specifically take a look at:
|
||||||
|
|
||||||
|
1. If `sFlow` is not enabled on a given interface, there should not be a regression on other
|
||||||
|
interfaces.
|
||||||
|
1. If `sFlow` _is_ enabled, but a packet is not sampled, the overhead should be as small as
|
||||||
|
possible, targetting single digit CPU cycles per packet in overhead.
|
||||||
|
1. If `sFlow` actually selects a packet for sampling, it should be moved out of the dataplane as
|
||||||
|
quickly as possible, targetting double digit CPU cycles per sample.
|
||||||
|
|
||||||
|
For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from
|
||||||
|
a T-Rex loadtester on eight TenGig ports, which I have configured as follows:
|
||||||
|
|
||||||
|
**1. RX Queue Placement**
|
||||||
|
|
||||||
|
It's important that the network card that is receiving the traffic, gets serviced by a worker thread
|
||||||
|
on the same NUMA domain. Since my machine has two CPUs (and thus, two NUMA nodes), I will align the
|
||||||
|
NIC with the correct CPU, like so:
|
||||||
|
|
||||||
|
```
|
||||||
|
set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0
|
||||||
|
set interface rx-placement TenGigabitEthernet3/0/1 queue 0 worker 2
|
||||||
|
set interface rx-placement TenGigabitEthernet3/0/2 queue 0 worker 4
|
||||||
|
set interface rx-placement TenGigabitEthernet3/0/3 queue 0 worker 6
|
||||||
|
|
||||||
|
set interface rx-placement TenGigabitEthernet130/0/0 queue 0 worker 1
|
||||||
|
set interface rx-placement TenGigabitEthernet130/0/1 queue 0 worker 3
|
||||||
|
set interface rx-placement TenGigabitEthernet130/0/2 queue 0 worker 5
|
||||||
|
set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7
|
||||||
|
```
|
||||||
|
|
||||||
|
**2. L3 IPv4/MPLS interfaces**
|
||||||
|
|
||||||
|
I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a
|
||||||
|
comparison with L3 IPv4 or MPLS running without `sFlow` (which I'll call the _baseline_ pair) and
|
||||||
|
one which is running _with_ `sFlow` (which I'll call the _experiment_ pair).
|
||||||
|
|
||||||
|
```
|
||||||
|
comment { L3: IPv4 interfaces }
|
||||||
|
set int state TenGigabitEthernet3/0/0 up
|
||||||
|
set int state TenGigabitEthernet3/0/1 up
|
||||||
|
set int state TenGigabitEthernet130/0/0 up
|
||||||
|
set int state TenGigabitEthernet130/0/1 up
|
||||||
|
set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
|
||||||
|
set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
|
||||||
|
set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
|
||||||
|
set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
|
||||||
|
ip route add 16.0.0.0/24 via 100.64.0.0
|
||||||
|
ip route add 48.0.0.0/24 via 100.64.1.0
|
||||||
|
ip route add 16.0.2.0/24 via 100.64.4.0
|
||||||
|
ip route add 48.0.2.0/24 via 100.64.5.0
|
||||||
|
ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
|
||||||
|
ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
|
||||||
|
ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
|
||||||
|
ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
|
||||||
|
```
|
||||||
|
|
||||||
|
Here, the only specific trick worth mentioning is the use of `ip neighbor` to pre-populate the L2
|
||||||
|
adjacency for the T-Rex loadtester. This way, VPP knows which MAC address to send traffic to, in
|
||||||
|
case a packet has to be forwarded to 100.64.0.0 or 100.64.5.0. It avoids VPP from having to use ARP
|
||||||
|
resolution.
|
||||||
|
|
||||||
|
The configuration for an MPLS label switching router _LSR_ or also called _P-Router_ is added:
|
||||||
|
|
||||||
|
```
|
||||||
|
comment { MPLS interfaces }
|
||||||
|
mpls table add 0
|
||||||
|
set interface mpls TenGigabitEthernet3/0/0 enable
|
||||||
|
set interface mpls TenGigabitEthernet3/0/1 enable
|
||||||
|
set interface mpls TenGigabitEthernet130/0/0 enable
|
||||||
|
set interface mpls TenGigabitEthernet130/0/1 enable
|
||||||
|
mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
|
||||||
|
mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
|
||||||
|
mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
|
||||||
|
mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
|
||||||
|
```
|
||||||
|
|
||||||
|
**3. L2 CrossConnect interfaces**
|
||||||
|
|
||||||
|
Here, I will use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig
|
||||||
|
interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison
|
||||||
|
on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the
|
||||||
|
_baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen.
|
||||||
|
|
||||||
|
```
|
||||||
|
comment { L2 xconnected interfaces }
|
||||||
|
set int state TenGigabitEthernet3/0/2 up
|
||||||
|
set int state TenGigabitEthernet3/0/3 up
|
||||||
|
set int state TenGigabitEthernet130/0/2 up
|
||||||
|
set int state TenGigabitEthernet130/0/3 up
|
||||||
|
set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
|
||||||
|
set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
|
||||||
|
set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
|
||||||
|
set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
|
||||||
|
```
|
||||||
|
|
||||||
|
**4. T-Rex Configuration**
|
||||||
|
|
||||||
|
The Cisco T-Rex loadtester is running on another machine in the same rack. Physically, it has eight
|
||||||
|
ports which are connected to a LAB switch, a cool Mellanox SN2700 running Debian [[ref]({{< ref
|
||||||
|
2023-11-11-mellanox-sn2700.md >}})]. From there, eight ports go to my VPP machine. The LAB switch
|
||||||
|
just has VLANs with two ports in each: VLAN 100 takes T-Rex port0 and connects it to TenGig3/0/0,
|
||||||
|
VLAN 101 takes port1 and connects it to TenGig3/0/1, and so on. In total, sixteen ports and eight
|
||||||
|
VLANs are used.
|
||||||
|
|
||||||
|
The configuration for T-Rex then becomes:
|
||||||
|
|
||||||
|
```
|
||||||
|
- version: 2
|
||||||
|
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
|
||||||
|
port_info:
|
||||||
|
- src_mac: 00:1b:21:06:00:00
|
||||||
|
dest_mac: 9c:69:b4:61:a1:dc
|
||||||
|
- src_mac: 00:1b:21:06:00:01
|
||||||
|
dest_mac: 9c:69:b4:61:a1:dd
|
||||||
|
|
||||||
|
- src_mac: 00:1b:21:83:00:00
|
||||||
|
dest_mac: 00:1b:21:83:00:01
|
||||||
|
- src_mac: 00:1b:21:83:00:01
|
||||||
|
dest_mac: 00:1b:21:83:00:00
|
||||||
|
|
||||||
|
- src_mac: 00:1b:21:87:00:00
|
||||||
|
dest_mac: 9c:69:b4:61:75:d0
|
||||||
|
- src_mac: 00:1b:21:87:00:01
|
||||||
|
dest_mac: 9c:69:b4:61:75:d1
|
||||||
|
|
||||||
|
- src_mac: 9c:69:b4:85:00:00
|
||||||
|
dest_mac: 9c:69:b4:85:00:01
|
||||||
|
- src_mac: 9c:69:b4:85:00:01
|
||||||
|
dest_mac: 9c:69:b4:85:00:00
|
||||||
|
```
|
||||||
|
|
||||||
|
Do you see how the first pair sends from `src_mac` 00:1b:21:06:00:00? That's the T-Rex side, and it
|
||||||
|
encodes the PCI device `06:00.0` in the MAC address. It sends traffic to `dest_mac`
|
||||||
|
9c:69:b4:61:a1:dc, which is the MAC address of VPP's TenGig3/0/0 interface. Looking back at the `ip
|
||||||
|
neighbor` VPP config above, it becomes much easier to see who is sending traffic to whom.
|
||||||
|
|
||||||
|
For L2XC, the MAC addresses don't matter. VPP will set the NIC in _promiscuous_ mode which means
|
||||||
|
it'll accept any ethernet frame, not only those sent to the NIC's own MAC address. Therefore, in
|
||||||
|
L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging
|
||||||
|
connections and looking up FDB entries on the Mellanox switch much, much easier this way.
|
||||||
|
|
||||||
|
With all config in place, I run a quick bidirectional loadtest using 256b packets at line rate,
|
||||||
|
which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS, IPv4, and L2XC. Neat!
|
||||||
|
|
||||||
|
{{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}}
|
||||||
|
|
||||||
|
The name of the game is now to do a loadtest that shows the packet throughput and CPU cycles spent
|
||||||
|
for each of the plugin iterations, comparing their performance on ports with and without `sFlow`
|
||||||
|
enabled. For each iteration, I will use exactly the same VPP configuration, I will generate
|
||||||
|
unidirectional 4x14.88Mpps of traffic with T-Rex, and I will report on VPP's performance in
|
||||||
|
_baseline_ and a somewhat unfavorable 1:100 sampling rate.
|
||||||
|
|
||||||
|
Ready? Here I go!
|
||||||
|
|
||||||
|
### v1: Workers send RPC to main
|
||||||
|
|
||||||
|
***TL/DR***: _13 cycles/packet on passthrough, 4.68Mpps L2, 3.26Mpps L3, with severe regression in
|
||||||
|
baseline_
|
||||||
|
|
||||||
|
The first iteration goes all the way back to a proof of concept from last year. It's described in
|
||||||
|
detail in my [[first post]({{< ref 2024-09-08-sflow-1.md >}})]. The performance results are not
|
||||||
|
stellar:
|
||||||
|
* ☢ When slamming a single sFlow enabled interface, _all interfaces_ regress. When sending 8Mpps
|
||||||
|
of IPv4 traffic through an _baseline_ interface, that is an interface _without_ sFlow enabled, only
|
||||||
|
5.2Mpps get through. This is considered a mortal sin in VPP-land.
|
||||||
|
* ✅ Passing through packets without sampling them, costs about 13 CPU cycles, not bad.
|
||||||
|
* ❌ Sampling a packet, specifically at higher rates (say, 1:100 or worse, 1:10) completely
|
||||||
|
destroys throughput. When sending 4x14.88MMpps of traffic, only one third makes it through.
|
||||||
|
|
||||||
|
Here's the bloodbath as seen from T-Rex:
|
||||||
|
|
||||||
|
{{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}}
|
||||||
|
|
||||||
|
**Debrief**: When talked throught these issues, we sort of drew the conclusion that it would be much
|
||||||
|
faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the
|
||||||
|
spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks
|
||||||
|
are needed, and each worker thread will have its own producer queue.
|
||||||
|
|
||||||
|
Then, we can create a separate thread (or even pool of threads), scheduling on possibly a different
|
||||||
|
CPU (or in main), that runs a loop iterating on all sflow sample queues, consuming the samples and
|
||||||
|
sending them in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too
|
||||||
|
many coming in.
|
||||||
|
|
||||||
|
### v2: Workers send PSAMPLE directly
|
||||||
|
|
||||||
|
**TL/DR**: _7.21Mpps IPv4 L3, 9.45Mpps L2XC, 87 cycles/packet, no impact on disabled interfaces_
|
||||||
|
|
||||||
|
But before we do that, we have one curiosity itch to scratch - what if we sent the sample directly
|
||||||
|
from the worker? With such a model, if it works, we will need no RPCs or sample queue at all. Of
|
||||||
|
course, in this model any sample will have to be rewritten into a PSAMPLE packet and written via the
|
||||||
|
netlink socket. It would be less complex, but not as efficient as it could be. One thing is prety
|
||||||
|
certain, though: it should be much faster than sending an RPC to the main thread.
|
||||||
|
|
||||||
|
After short refactor, Neil commits [[d278273](https://github.com/sflow/vpp-sflow/commit/d278273)],
|
||||||
|
which adds compiler macros `SFLOW_SEND_FROM_WORKER` (v2) and `SFLOW_SEND_VIA_MAIN` (v1). When
|
||||||
|
workers send directly, they will invoke `sflow_send_sample_from_worker()` instead of sending an RPC
|
||||||
|
with `vl_api_rpc_call_main_thread()` in the previous version.
|
||||||
|
|
||||||
|
The code currently uses `clib_warning()` to print stats from the dataplane, which is pretty
|
||||||
|
expensive. We should be using the VPP logging framework, but for the time being, I add a few CPU
|
||||||
|
counters so we can more accurately count the cummulative time spent for each part of the calls, see
|
||||||
|
[[6ca61d2](https://github.com/sflow/vpp-sflow/commit/6ca61d2)]. I can now see these with `vppctl show
|
||||||
|
err` instead.
|
||||||
|
|
||||||
|
When loadtesting this, the deadly sin of impacting performance of interfaces that did not have
|
||||||
|
`sFlow` enabled is gone. The throughput is not great, though. Instead of showing screenshots of
|
||||||
|
T-Rex, I can also take a look at the throughput as measured by VPP itself. In its `show runtime`
|
||||||
|
statistics, each worker thread shows both CPU cycles spent, as well as how many packets/sec it
|
||||||
|
received and how many it transmitted:
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@hvn6-lab:~$ export C="v2-100"; vppctl clear run; vppctl clear err; sleep 30; \
|
||||||
|
vppctl show run > $C-runtime.txt; vppctl show err > $C-err.txt
|
||||||
|
pim@hvn6-lab:~$ grep 'vector rates' v2-100-runtime.txt | grep -v 'in 0'
|
||||||
|
vector rates in 1.0909e7, out 1.0909e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 7.2078e6, out 7.2078e6, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 9.4476e6, out 9.4476e6, drop 0.0000e0, punt 0.0000e0
|
||||||
|
pim@hvn6-lab:~$ grep 'sflow' v2-100-runtime.txt
|
||||||
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||||
|
sflow active 844916 216298496 0 8.69e1 256.00
|
||||||
|
sflow active 1107466 283511296 0 8.26e1 256.00
|
||||||
|
pim@hvn6-lab:~$ grep -i sflow v2-100-err.txt
|
||||||
|
217929472 sflow sflow packets processed error
|
||||||
|
1614519 sflow sflow packets sampled error
|
||||||
|
2606893106 sflow CPU cycles in sent samples error
|
||||||
|
280697344 sflow sflow packets processed error
|
||||||
|
2078203 sflow sflow packets sampled error
|
||||||
|
1844674406 sflow CPU cycles in sent samples error
|
||||||
|
```
|
||||||
|
|
||||||
|
At a glance, I can see in the first `grep`, the in and out vector (==packet) rates for each worker
|
||||||
|
thread that is doing meaningful work (ie. has more than 0pps of input). Remember that I pinned the
|
||||||
|
RX queues to worker threads, and this now pays dividend: worker thread 0 is servicing TenGig3/0/0
|
||||||
|
(as _even_ worker thread numbers are on NUMA domain 0), worker thread 1 is servicing TenGig130/0/0.
|
||||||
|
What's cool about this, is it gives me an easy way to compare baseline L3 (10.9Mpps) with experiment
|
||||||
|
L3 (7.21Mpps). Equally, L2XC comes in at 14.88Mpps in baseline and 9.45Mpps in experiment.
|
||||||
|
|
||||||
|
Looking at the output of `vppctl show error`, I can learn another interesting detail. See how there
|
||||||
|
are 1614519 sampled packets out of 217929472 processed packets (ie. a roughly 1:100 rate)? I added a
|
||||||
|
CPU clock cycle counter that counts cummulative clocks spent once samples are taken. I can see that
|
||||||
|
VPP spent 2606893106 CPU cycles sending these samples. That's **1615 CPU cycles** per sent sample,
|
||||||
|
which is pretty terrible.
|
||||||
|
|
||||||
|
**Debrief**: We both understand that assembling and `send()`ing the netlink messages from within the
|
||||||
|
dataplane is a pretty bad idea. But it's great to see that removing the use of RPCs immediately
|
||||||
|
improves performance on non-enabled interfaces, and we learned what the cost is of sending those
|
||||||
|
samples. An easy step forward from here is to create a producer/consumer queue, where the workers
|
||||||
|
can just copy the packet into a queue or ring buffer, and have an external `pthread` consume from
|
||||||
|
the queue/ring in another thread that won't block the dataplane.
|
||||||
|
|
||||||
|
### v3: SVM FIFO from workers, dedicated PSAMPLE pthread
|
||||||
|
|
||||||
|
**TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_
|
||||||
|
|
||||||
|
Neil reports back after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)]
|
||||||
|
that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty
|
||||||
|
elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment
|
||||||
|
called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new
|
||||||
|
thread called `spt_process_samples` can then call `svm_fifo_dequeue()` from all workers' queues and
|
||||||
|
pump those into Netlink.
|
||||||
|
|
||||||
|
The overhead of copying the samples onto a VPP native `svm_fifo` seems to be two orders of magnitude
|
||||||
|
lower than writing directly to Netlink, even though the `svm_fifo` library code has many bells and
|
||||||
|
whistles that we don't need. But, perhaps due to these bells and whistles, we may be holding it
|
||||||
|
wrong, as invariably after a short while the Netlink writes return _Message too long_ errors.
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@hvn6-lab:~$ grep 'vector rates' v3fifo-sflow-100-runtime.txt | grep -v 'in 0'
|
||||||
|
vector rates in 1.0783e7, out 1.0783e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 9.3499e6, out 9.3499e6, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 1.4728e7, out 1.4728e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 1.3516e7, out 1.3516e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-runtime.txt
|
||||||
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||||
|
sflow active 1096132 280609792 0 1.63e1 256.00
|
||||||
|
sflow active 1584577 405651712 0 1.46e1 256.00
|
||||||
|
pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-err.txt
|
||||||
|
280635904 sflow sflow packets processed error
|
||||||
|
2079194 sflow sflow packets sampled error
|
||||||
|
733447310 sflow CPU cycles in sent samples error
|
||||||
|
405689856 sflow sflow packets processed error
|
||||||
|
3004118 sflow sflow packets sampled error
|
||||||
|
1844674407 sflow CPU cycles in sent samples error
|
||||||
|
```
|
||||||
|
|
||||||
|
Two things of note here. Firstly, the average clocks spent in the `sFlow` node have gone down from
|
||||||
|
86 CPU cycles/packet to 16.3 CPU cycles. But even more importantly, the amount of time spent after
|
||||||
|
the sample is taken is hugely reduced, from 1600+ cycles in v2 to a much more favorable 352 cycles
|
||||||
|
in this version. Also, any risk of Netlink writes failing has been eliminated, because that's now
|
||||||
|
offloaded to a different thread entirely.
|
||||||
|
|
||||||
|
**Debrief**: It's not great that we created a new linux `pthread` for the consumer of the samples.
|
||||||
|
VPP has an elaborate thread management system, and collaborative multitasking in its threading
|
||||||
|
model, which adds introspection like clock counters, names, `show runtime`, `show threads` and so
|
||||||
|
on. I can't help but wonder: wouldn't we just be able to move the `spt_process_samples()` thread
|
||||||
|
into a VPP process node instead?
|
||||||
|
|
||||||
|
### v3bis: SVM FIFO, PSAMPLE process in Main
|
||||||
|
|
||||||
|
**TL/DR:** _9.68Mpps L3, 14.10Mpps L2XC, 14.2 cycles/packet, still with corrupted FIFO queue messages_
|
||||||
|
|
||||||
|
Neil agrees that there's no good reason to keep this out of main, and conjures up
|
||||||
|
[[df2dab8d](https://github.com/vpp/sflow-vpp/df2dab8d)] which rewrites the thread to an
|
||||||
|
`sflow_process_samples()` function, using `VLIB_REGISTER_NODE` to add it to VPP in an idiomatic way.
|
||||||
|
As a really nice benefit, we can now count how many CPU cycles are spent, in _main_, each time this
|
||||||
|
_process_ wakes up and does some work. It's a widely used pattern in VPP.
|
||||||
|
|
||||||
|
Because of the FIFO queue message corruption, Netlink message are failing to send at an alarming
|
||||||
|
rate, which is causing lots of `clib_warning()` messages to be spewed on console. I replace those
|
||||||
|
with a counter of Failed Netlink messages instead, and commit refactor
|
||||||
|
[[6ba4715](https://github.com/sflow/vpp-sflow/6ba4715d050f76cfc582055958d50bf4cc8a0ad1)].
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@hvn6-lab:~$ grep 'vector rates' v3bis-100-runtime.txt | grep -v 'in 0'
|
||||||
|
vector rates in 1.0976e7, out 1.0976e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 9.6743e6, out 9.6743e6, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 1.4052e7, out 1.4052e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
pim@hvn6-lab:~$ grep sflow v3bis-100-runtime.txt
|
||||||
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||||
|
sflow-process-samples any wait 0 0 28052 4.66e4 0.00
|
||||||
|
sflow active 1134102 290330112 0 1.42e1 256.00
|
||||||
|
sflow active 1647240 421693440 0 1.32e1 256.00
|
||||||
|
pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt
|
||||||
|
77945 sflow sflow PSAMPLE sent error
|
||||||
|
863 sflow sflow PSAMPLE send failed error
|
||||||
|
290376960 sflow sflow packets processed error
|
||||||
|
2151184 sflow sflow packets sampled error
|
||||||
|
421761024 sflow sflow packets processed error
|
||||||
|
3119625 sflow sflow packets sampled error
|
||||||
|
```
|
||||||
|
|
||||||
|
With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up
|
||||||
|
and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using
|
||||||
|
4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the 'sflow PSAMPLE send failed`
|
||||||
|
counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.
|
||||||
|
|
||||||
|
**Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All
|
||||||
|
these send failures and corrupt packets are really messing things up. So while the provided FIFO
|
||||||
|
implementation in `svm/fifo_segment.h` is idiomatic, it is also much more complex than we thought,
|
||||||
|
and we're fearing that it may not be safe to read from another thread.
|
||||||
|
|
||||||
|
### v4: Custom lockless FIFO, PSAMPLE process in Main
|
||||||
|
|
||||||
|
**TL/DR:** _9.56Mpps L3, 13.69Mpps L2XC, 15.6 cycles/packet, corruption fixed!_
|
||||||
|
|
||||||
|
After reading around a bit in DPDK's
|
||||||
|
[[kni_fifo](https://doc.dpdk.org/api-18.11/rte__kni__fifo_8h_source.html)], Neil produces a gem of a
|
||||||
|
commit in
|
||||||
|
[[42bbb64](https://github.com/sflow/vpp-sflow/commit/42bbb643b1f11e8498428d3f7d20cde4de8ee201)],
|
||||||
|
where he introduces a tiny multiple-writer, single-consumer FIFO with two simple functions:
|
||||||
|
`sflow_fifo_enqueue()` to be called in the workers, and `sflow_fifo_dequeue()` to be called in the
|
||||||
|
main thread's `sflow-process-samples` process. He then makes this thread-safe by doing what I
|
||||||
|
consider black magic, in commit
|
||||||
|
[[dd8af17](https://github.com/sflow/vpp-sflow/commit/dd8af1722d579adc9d08656cd7ec8cf8b9ac11d6)],
|
||||||
|
which makes use of `clib_atomic_load_acq_n()` and `clib_atomic_store_rel_n()` macros from VPP's
|
||||||
|
`vppinfra/atomics.h`.
|
||||||
|
|
||||||
|
What I really like about this change is that it introduces a FIFO implementation in about twenty
|
||||||
|
lines of code, which means the sampling code path in the dataplane becomes really easy to follow,
|
||||||
|
and will be even faster than it was before. I take it out for a loadtest:
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@hvn6-lab:~$ grep 'vector rates' v4-100-runtime.txt | grep -v 'in 0'
|
||||||
|
vector rates in 1.0958e7, out 1.0958e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 9.5633e6, out 9.5633e6, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 1.3697e7, out 1.3697e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
pim@hvn6-lab:~$ grep sflow v4-100-runtime.txt
|
||||||
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||||
|
sflow-process-samples any wait 0 0 17767 1.52e6 0.00
|
||||||
|
sflow active 1121156 287015936 0 1.56e1 256.00
|
||||||
|
sflow active 1605772 411077632 0 1.53e1 256.00
|
||||||
|
pim@hvn6-lab:~$ grep sflow v4-100-err.txt
|
||||||
|
3553600 sflow sflow PSAMPLE sent error
|
||||||
|
287101184 sflow sflow packets processed error
|
||||||
|
2127024 sflow sflow packets sampled error
|
||||||
|
350224 sflow sflow packets dropped error
|
||||||
|
411199744 sflow sflow packets processed error
|
||||||
|
3043693 sflow sflow packets sampled error
|
||||||
|
1266893 sflow sflow packets dropped error
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
But this is starting to be a very nice implementation! With this iteration of the plugin, all the
|
||||||
|
corruption is gone, there is a slight regression (because we're now actually _sending_ the
|
||||||
|
messages). With the v3bis variant, only a tiny fraction of the sampels made it through to netlink.
|
||||||
|
With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen
|
||||||
|
FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying
|
||||||
|
to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken,
|
||||||
|
350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!
|
||||||
|
|
||||||
|
Doign the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per
|
||||||
|
interface. I can also see that the second interfae, which is doing L2XC and hits a much larger
|
||||||
|
packets/sec throughput, is dropping more samples because it has an equal amount of time from main
|
||||||
|
reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd
|
||||||
|
out another. Slick.
|
||||||
|
|
||||||
|
Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that
|
||||||
|
main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so the
|
||||||
|
`sflow PSAMPLE send failed` counter remains zero.
|
||||||
|
|
||||||
|
{{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}}
|
||||||
|
|
||||||
|
**Debrief**: In the mean time, Neil has been working on the `host-sflow` daemon changes to pick up
|
||||||
|
these netlink messages. There's also a bit of work to do with retrieving the packet and byte
|
||||||
|
counters of the VPP interfaces, so he is creating a module in `host-sflow` that can consume some
|
||||||
|
messages from VPP. He will call this `mod_vpp`, and he mails a screenshot of his work in progress.
|
||||||
|
I'll discuss the end-to-end changes with `hsflowd` in a followup article, and focus my efforts here
|
||||||
|
on documenting the VPP parts only. But, as a teaser, here's a screenshot of a validated
|
||||||
|
`sflow-tool` output of a VPP instance using our `sFlow` plugin and his pending `host-sflow` changes
|
||||||
|
to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably
|
||||||
|
expensive to make mistakes.
|
||||||
|
|
||||||
|
Neil reveals an itch that he has been meaning to scratch all this time. In VPP's
|
||||||
|
`plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really
|
||||||
|
most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To
|
||||||
|
make use of some CPU instruction cache affinity, the loop that does this shoveling can do it one
|
||||||
|
packet at a time, two packets at a time, or even four packets at a time. Although the code is super
|
||||||
|
repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per
|
||||||
|
packet, if you shovel four of them at a time.
|
||||||
|
|
||||||
|
### v5: Quad Bucket Brigade in worker
|
||||||
|
|
||||||
|
**TL/DR:** _9.68Mpps L3, 14.0Mpps L2XC, 11 CPU cycles/packet, 1.28e5 CPU cycles in main_
|
||||||
|
|
||||||
|
Neil calls this the _Quad Bucket Brigade_, and one last finishing touch is to move from his default
|
||||||
|
2-packet to a 4-packet shoveling. In commit
|
||||||
|
[[285d8a0](https://github.com/sflow/vpp-sflow/commit/285d8a097b74bb38eeb14a922a1e8c1115da2ef2)], he
|
||||||
|
extends a common pattern in VPP dataplane nodes, each time the node iterates, it'll pre-fetch now up
|
||||||
|
to eight packets (`p0-p7`) if the vector is long enough, and handle them four at a time (`b0-b3`).
|
||||||
|
He also adds a few compiler hints with branch prediction: almost no packets will have a trace
|
||||||
|
enabled, so he can use `PREDICT_FALSE()` macros to allow the compiler to further optimize the code.
|
||||||
|
|
||||||
|
I find reading the dataplane code, that it is incredibly ugly. But it's the price to pay for ultra
|
||||||
|
fast throughput. But how do we see the effect? My low-tech proposal is to enable sampling at a very
|
||||||
|
high rate, say 1:10'000'000, so that the code path that grabs and enqueues the sample into the FIFO
|
||||||
|
is almost never called. Then, what's left for the `sFlow` dataplane node, really is to shovel the
|
||||||
|
packets from `device-input` into `ethernet-input`.
|
||||||
|
|
||||||
|
To measure the relative improvement, I do one test with, and one without commit
|
||||||
|
[[285d8a09](https://github.com/sflow/vpp-sflow/commit/285d8a09)].
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@hvn6-lab:~$ grep 'vector rates' v5-10M-runtime.txt | grep -v 'in 0'
|
||||||
|
vector rates in 1.0981e7, out 1.0981e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 9.8806e6, out 9.8806e6, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 1.4328e7, out 1.4328e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
pim@hvn6-lab:~$ grep sflow v5-10M-runtime.txt
|
||||||
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||||
|
sflow-process-samples any wait 0 0 28467 9.36e3 0.00
|
||||||
|
sflow active 1158325 296531200 0 1.09e1 256.00
|
||||||
|
sflow active 1679742 430013952 0 1.11e1 256.00
|
||||||
|
|
||||||
|
pim@hvn6-lab:~$ grep 'vector rates' v5-noquadbrigade-10M-runtime.txt | grep -v in\ 0
|
||||||
|
vector rates in 1.0959e7, out 1.0959e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 9.7046e6, out 9.7046e6, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
vector rates in 1.4008e7, out 1.4008e7, drop 0.0000e0, punt 0.0000e0
|
||||||
|
pim@hvn6-lab:~$ grep sflow v5-noquadbrigade-10M-runtime.txt
|
||||||
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||||
|
sflow-process-samples any wait 0 0 28462 9.57e3 0.00
|
||||||
|
sflow active 1137571 291218176 0 1.26e1 256.00
|
||||||
|
sflow active 1641991 420349696 0 1.20e1 256.00
|
||||||
|
```
|
||||||
|
|
||||||
|
Would you look at that, this optimization actually works as advertised! There is a meaningful
|
||||||
|
_progression_ from v5-noquadbrigade (9.70Mpps L3, 14.00Mpps L2XC) to v5 (9.88Mpps L3, 14.32Mpps
|
||||||
|
L2XC). So at the expense of adding 63 lines of code, there is a 2.8% increase in throughput.
|
||||||
|
**Quad-Bucket-Brigade, yaay!**
|
||||||
|
|
||||||
|
I'll leave you with a beautiful screenshot of the current code at HEAD, as it is sampling 1:100
|
||||||
|
packets (!) on four interfaces, while forwarding 8x10G of 256 byte packets at line rate. You'll
|
||||||
|
recall at the beginning of this article I did an acceptance loadtest with sFlow disabled, but this
|
||||||
|
is the exact same result **with sFlow** enabled:
|
||||||
|
|
||||||
|
{{< image src="/assets/sflow/trex-sflow-acceptance.png" alt="T-Rex sFlow Acceptance Loadtest" >}}
|
||||||
|
|
||||||
|
This picture says it all: 79.98 Gbps in, 79.98 Gbps out; 36.22Mpps in, 36.22Mpps out. Also: 176k
|
||||||
|
samples/sec taken from the dataplane, with correct rate limiting due to a per-worker FIFO depth
|
||||||
|
limit, yielding 25k samples/sec sent to Netlink.
|
||||||
|
|
||||||
|
## What's Next
|
||||||
|
|
||||||
|
Checking in on the three main things we wanted to ensure with the plugin:
|
||||||
|
|
||||||
|
1. ✅ If `sFlow` is not enabled on a given interface, there is no regression on other interfaces.
|
||||||
|
1. ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average
|
||||||
|
1. ✅ If `sFlow` samples, it takes only marginally more CPU time to enqueue. No sampling gets
|
||||||
|
9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput, while 1:1000 sampling reduces to 9.77Mpps of L3
|
||||||
|
and 14.05Mpps of L2XC throughput. An overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.
|
||||||
|
|
||||||
|
The hard part is done, but we're not entirely done yet. What's left is to implement a set of packet
|
||||||
|
and byte counters, and send this information along with possible Linux CP data (such as the TAP
|
||||||
|
interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about that
|
||||||
|
part in a followup article.
|
||||||
|
|
||||||
|
Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed
|
||||||
|
folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the
|
||||||
|
ecosystem. Our work so far is captured in Gerrit [[41680](https://gerrit.fd.io/r/c/vpp/+/41680)],
|
||||||
|
which ends up being just over 2600 lines all-up. I do think we need to refactor a bit, add some
|
||||||
|
VPP-specific tidbits like `FEATURE.yaml` and `*.rst` documentation, but this should be in reasonable
|
||||||
|
shape.
|
||||||
|
|
||||||
|
### Acknowledgements
|
||||||
|
|
||||||
|
I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
|
||||||
|
finer details such as logging, error handling, API specifications, and documentation. He has been a
|
||||||
|
true pleasure to work with and learn from.
|
BIN
static/assets/sflow/hsflowd-demo.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/hsflowd-demo.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/sflow/trex-acceptance.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/trex-acceptance.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/sflow/trex-sflow-acceptance.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/trex-sflow-acceptance.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/sflow/trex-v1.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/trex-v1.png
(Stored with Git LFS)
Normal file
Binary file not shown.
Reference in New Issue
Block a user