Some typo and readability fixes
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
Pim van Pelt
2024-10-06 19:11:42 +02:00
parent 3c69130cea
commit ba068c1c52

View File

@ -41,13 +41,13 @@ possible, targetting single digit CPU cycles per packet in overhead.
quickly as possible, targetting double digit CPU cycles per sample. quickly as possible, targetting double digit CPU cycles per sample.
For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from
a T-Rex loadtester on eight TenGig ports, which I have configured as follows: a T-Rex loadtester on eight TenGig ports. I have configured VPP and T-Rex as follows.
**1. RX Queue Placement** **1. RX Queue Placement**
It's important that the network card that is receiving the traffic, gets serviced by a worker thread It's important that the network card that is receiving the traffic, gets serviced by a worker thread
on the same NUMA domain. Since my machine has two CPUs (and thus, two NUMA nodes), I will align the on the same NUMA domain. Since my machine has two processors (and thus, two NUMA nodes), I will
NIC with the correct CPU, like so: align the NIC with the correct processor, like so:
``` ```
set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0 set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0
@ -64,8 +64,9 @@ set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7
**2. L3 IPv4/MPLS interfaces** **2. L3 IPv4/MPLS interfaces**
I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a
comparison with L3 IPv4 or MPLS running without `sFlow` (which I'll call the _baseline_ pair) and comparison with L3 IPv4 or MPLS running _without_ `sFlow` (these are TenGig3/0/*, which I will call
one which is running _with_ `sFlow` (which I'll call the _experiment_ pair). the _baseline_ pairs) and two which are running _with_ `sFlow` (these are TenGig130/0/*, which I'll
call the _experiment_ pairs).
``` ```
comment { L3: IPv4 interfaces } comment { L3: IPv4 interfaces }
@ -109,7 +110,7 @@ mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels
**3. L2 CrossConnect interfaces** **3. L2 CrossConnect interfaces**
Here, I will use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig Here, I will also use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig
interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison
on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the
_baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen. _baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen.
@ -172,8 +173,9 @@ it'll accept any ethernet frame, not only those sent to the NIC's own MAC addres
L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging
connections and looking up FDB entries on the Mellanox switch much, much easier this way. connections and looking up FDB entries on the Mellanox switch much, much easier this way.
With all config in place, I run a quick bidirectional loadtest using 256b packets at line rate, With all config in place, but with `sFlow` disabled, I run a quick bidirectional loadtest using 256b
which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS, IPv4, and L2XC. Neat! packets at line rate, which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS,
IPv4, and L2XC. Neat!
{{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}} {{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}}
@ -204,7 +206,7 @@ Here's the bloodbath as seen from T-Rex:
{{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}} {{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}}
**Debrief**: When talked throught these issues, we sort of drew the conclusion that it would be much **Debrief**: When we talked through these issues, we sort of drew the conclusion that it would be much
faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the
spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks
are needed, and each worker thread will have its own producer queue. are needed, and each worker thread will have its own producer queue.
@ -286,7 +288,7 @@ the queue/ring in another thread that won't block the dataplane.
**TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_ **TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_
Neil reports back after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)] Neil checks in after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)]
that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty
elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment
called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new
@ -366,7 +368,7 @@ pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt
With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up
and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using
4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the 'sflow PSAMPLE send failed` 4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the `sflow PSAMPLE send failed`
counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice. counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.
**Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All **Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All
@ -416,23 +418,23 @@ pim@hvn6-lab:~$ grep sflow v4-100-err.txt
``` ```
But this is starting to be a very nice implementation! With this iteration of the plugin, all the This is starting to be a very nice implementation! With this iteration of the plugin, all the
corruption is gone, there is a slight regression (because we're now actually _sending_ the corruption is gone, there is a slight regression (because we're now actually _sending_ the
messages). With the v3bis variant, only a tiny fraction of the sampels made it through to netlink. messages). With the v3bis variant, only a tiny fraction of the samples made it through to netlink.
With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen
FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying
to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken, to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken,
350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth! 350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!
Doign the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per Doing the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per
interface. I can also see that the second interfae, which is doing L2XC and hits a much larger interface. I can also see that the second interface, which is doing L2XC and hits a much larger
packets/sec throughput, is dropping more samples because it has an equal amount of time from main packets/sec throughput, is dropping more samples because it receives an equal amount of time from main
reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd
out another. Slick. out another. Slick.
Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that
main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so the main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so
`sflow PSAMPLE send failed` counter remains zero. the `sflow PSAMPLE send failed` counter remains zero.
{{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}} {{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}}
@ -446,10 +448,10 @@ on documenting the VPP parts only. But, as a teaser, here's a screenshot of a v
to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably
expensive to make mistakes. expensive to make mistakes.
Neil reveals an itch that he has been meaning to scratch all this time. In VPP's Neil admits to an itch that he has been meaning to scratch all this time. In VPP's
`plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really `plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really
most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To
make use of some CPU instruction cache affinity, the loop that does this shoveling can do it one make use of some CPU instruction cache affinity, the loop that does this shovelling can do it one
packet at a time, two packets at a time, or even four packets at a time. Although the code is super packet at a time, two packets at a time, or even four packets at a time. Although the code is super
repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per
packet, if you shovel four of them at a time. packet, if you shovel four of them at a time.
@ -519,16 +521,17 @@ limit, yielding 25k samples/sec sent to Netlink.
Checking in on the three main things we wanted to ensure with the plugin: Checking in on the three main things we wanted to ensure with the plugin:
1. ✅ If `sFlow` is not enabled on a given interface, there is no regression on other interfaces. 1. ✅ If `sFlow` _is not_ enabled on a given interface, there is no regression on other interfaces.
1. ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average 1. ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average
1. ✅ If `sFlow` samples, it takes only marginally more CPU time to enqueue. No sampling gets 1. ✅ If `sFlow` takes a sample, it takes only marginally more CPU time to enqueue.
9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput, while 1:1000 sampling reduces to 9.77Mpps of L3 * No sampling gets 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput,
and 14.05Mpps of L2XC throughput. An overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only. * 1:1000 sampling reduces to 9.77Mpps of L3 and 14.05Mpps of L2XC throughput,
* and an overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.
The hard part is done, but we're not entirely done yet. What's left is to implement a set of packet The hard part is finished, but we're not entirely done yet. What's left is to implement a set of
and byte counters, and send this information along with possible Linux CP data (such as the TAP packet and byte counters, and send this information along with possible Linux CP data (such as the
interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about that TAP interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about
part in a followup article. that part in a followup article.
Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed
folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the