Some typo and readability fixes
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -41,13 +41,13 @@ possible, targetting single digit CPU cycles per packet in overhead.
|
||||
quickly as possible, targetting double digit CPU cycles per sample.
|
||||
|
||||
For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from
|
||||
a T-Rex loadtester on eight TenGig ports, which I have configured as follows:
|
||||
a T-Rex loadtester on eight TenGig ports. I have configured VPP and T-Rex as follows.
|
||||
|
||||
**1. RX Queue Placement**
|
||||
|
||||
It's important that the network card that is receiving the traffic, gets serviced by a worker thread
|
||||
on the same NUMA domain. Since my machine has two CPUs (and thus, two NUMA nodes), I will align the
|
||||
NIC with the correct CPU, like so:
|
||||
on the same NUMA domain. Since my machine has two processors (and thus, two NUMA nodes), I will
|
||||
align the NIC with the correct processor, like so:
|
||||
|
||||
```
|
||||
set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0
|
||||
@ -64,8 +64,9 @@ set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7
|
||||
**2. L3 IPv4/MPLS interfaces**
|
||||
|
||||
I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a
|
||||
comparison with L3 IPv4 or MPLS running without `sFlow` (which I'll call the _baseline_ pair) and
|
||||
one which is running _with_ `sFlow` (which I'll call the _experiment_ pair).
|
||||
comparison with L3 IPv4 or MPLS running _without_ `sFlow` (these are TenGig3/0/*, which I will call
|
||||
the _baseline_ pairs) and two which are running _with_ `sFlow` (these are TenGig130/0/*, which I'll
|
||||
call the _experiment_ pairs).
|
||||
|
||||
```
|
||||
comment { L3: IPv4 interfaces }
|
||||
@ -109,7 +110,7 @@ mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels
|
||||
|
||||
**3. L2 CrossConnect interfaces**
|
||||
|
||||
Here, I will use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig
|
||||
Here, I will also use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig
|
||||
interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison
|
||||
on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the
|
||||
_baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen.
|
||||
@ -172,8 +173,9 @@ it'll accept any ethernet frame, not only those sent to the NIC's own MAC addres
|
||||
L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging
|
||||
connections and looking up FDB entries on the Mellanox switch much, much easier this way.
|
||||
|
||||
With all config in place, I run a quick bidirectional loadtest using 256b packets at line rate,
|
||||
which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS, IPv4, and L2XC. Neat!
|
||||
With all config in place, but with `sFlow` disabled, I run a quick bidirectional loadtest using 256b
|
||||
packets at line rate, which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS,
|
||||
IPv4, and L2XC. Neat!
|
||||
|
||||
{{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}}
|
||||
|
||||
@ -204,7 +206,7 @@ Here's the bloodbath as seen from T-Rex:
|
||||
|
||||
{{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}}
|
||||
|
||||
**Debrief**: When talked throught these issues, we sort of drew the conclusion that it would be much
|
||||
**Debrief**: When we talked through these issues, we sort of drew the conclusion that it would be much
|
||||
faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the
|
||||
spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks
|
||||
are needed, and each worker thread will have its own producer queue.
|
||||
@ -286,7 +288,7 @@ the queue/ring in another thread that won't block the dataplane.
|
||||
|
||||
**TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_
|
||||
|
||||
Neil reports back after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)]
|
||||
Neil checks in after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)]
|
||||
that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty
|
||||
elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment
|
||||
called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new
|
||||
@ -366,7 +368,7 @@ pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt
|
||||
|
||||
With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up
|
||||
and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using
|
||||
4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the 'sflow PSAMPLE send failed`
|
||||
4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the `sflow PSAMPLE send failed`
|
||||
counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.
|
||||
|
||||
**Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All
|
||||
@ -416,23 +418,23 @@ pim@hvn6-lab:~$ grep sflow v4-100-err.txt
|
||||
```
|
||||
|
||||
|
||||
But this is starting to be a very nice implementation! With this iteration of the plugin, all the
|
||||
This is starting to be a very nice implementation! With this iteration of the plugin, all the
|
||||
corruption is gone, there is a slight regression (because we're now actually _sending_ the
|
||||
messages). With the v3bis variant, only a tiny fraction of the sampels made it through to netlink.
|
||||
messages). With the v3bis variant, only a tiny fraction of the samples made it through to netlink.
|
||||
With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen
|
||||
FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying
|
||||
to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken,
|
||||
350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!
|
||||
|
||||
Doign the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per
|
||||
interface. I can also see that the second interfae, which is doing L2XC and hits a much larger
|
||||
packets/sec throughput, is dropping more samples because it has an equal amount of time from main
|
||||
Doing the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per
|
||||
interface. I can also see that the second interface, which is doing L2XC and hits a much larger
|
||||
packets/sec throughput, is dropping more samples because it receives an equal amount of time from main
|
||||
reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd
|
||||
out another. Slick.
|
||||
|
||||
Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that
|
||||
main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so the
|
||||
`sflow PSAMPLE send failed` counter remains zero.
|
||||
main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so
|
||||
the `sflow PSAMPLE send failed` counter remains zero.
|
||||
|
||||
{{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}}
|
||||
|
||||
@ -446,10 +448,10 @@ on documenting the VPP parts only. But, as a teaser, here's a screenshot of a v
|
||||
to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably
|
||||
expensive to make mistakes.
|
||||
|
||||
Neil reveals an itch that he has been meaning to scratch all this time. In VPP's
|
||||
Neil admits to an itch that he has been meaning to scratch all this time. In VPP's
|
||||
`plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really
|
||||
most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To
|
||||
make use of some CPU instruction cache affinity, the loop that does this shoveling can do it one
|
||||
make use of some CPU instruction cache affinity, the loop that does this shovelling can do it one
|
||||
packet at a time, two packets at a time, or even four packets at a time. Although the code is super
|
||||
repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per
|
||||
packet, if you shovel four of them at a time.
|
||||
@ -519,16 +521,17 @@ limit, yielding 25k samples/sec sent to Netlink.
|
||||
|
||||
Checking in on the three main things we wanted to ensure with the plugin:
|
||||
|
||||
1. ✅ If `sFlow` is not enabled on a given interface, there is no regression on other interfaces.
|
||||
1. ✅ If `sFlow` _is not_ enabled on a given interface, there is no regression on other interfaces.
|
||||
1. ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average
|
||||
1. ✅ If `sFlow` samples, it takes only marginally more CPU time to enqueue. No sampling gets
|
||||
9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput, while 1:1000 sampling reduces to 9.77Mpps of L3
|
||||
and 14.05Mpps of L2XC throughput. An overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.
|
||||
1. ✅ If `sFlow` takes a sample, it takes only marginally more CPU time to enqueue.
|
||||
* No sampling gets 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput,
|
||||
* 1:1000 sampling reduces to 9.77Mpps of L3 and 14.05Mpps of L2XC throughput,
|
||||
* and an overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.
|
||||
|
||||
The hard part is done, but we're not entirely done yet. What's left is to implement a set of packet
|
||||
and byte counters, and send this information along with possible Linux CP data (such as the TAP
|
||||
interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about that
|
||||
part in a followup article.
|
||||
The hard part is finished, but we're not entirely done yet. What's left is to implement a set of
|
||||
packet and byte counters, and send this information along with possible Linux CP data (such as the
|
||||
TAP interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about
|
||||
that part in a followup article.
|
||||
|
||||
Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed
|
||||
folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the
|
||||
|
Reference in New Issue
Block a user