Some typo and readability fixes
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -41,13 +41,13 @@ possible, targetting single digit CPU cycles per packet in overhead.
|
|||||||
quickly as possible, targetting double digit CPU cycles per sample.
|
quickly as possible, targetting double digit CPU cycles per sample.
|
||||||
|
|
||||||
For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from
|
For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from
|
||||||
a T-Rex loadtester on eight TenGig ports, which I have configured as follows:
|
a T-Rex loadtester on eight TenGig ports. I have configured VPP and T-Rex as follows.
|
||||||
|
|
||||||
**1. RX Queue Placement**
|
**1. RX Queue Placement**
|
||||||
|
|
||||||
It's important that the network card that is receiving the traffic, gets serviced by a worker thread
|
It's important that the network card that is receiving the traffic, gets serviced by a worker thread
|
||||||
on the same NUMA domain. Since my machine has two CPUs (and thus, two NUMA nodes), I will align the
|
on the same NUMA domain. Since my machine has two processors (and thus, two NUMA nodes), I will
|
||||||
NIC with the correct CPU, like so:
|
align the NIC with the correct processor, like so:
|
||||||
|
|
||||||
```
|
```
|
||||||
set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0
|
set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0
|
||||||
@ -64,8 +64,9 @@ set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7
|
|||||||
**2. L3 IPv4/MPLS interfaces**
|
**2. L3 IPv4/MPLS interfaces**
|
||||||
|
|
||||||
I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a
|
I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a
|
||||||
comparison with L3 IPv4 or MPLS running without `sFlow` (which I'll call the _baseline_ pair) and
|
comparison with L3 IPv4 or MPLS running _without_ `sFlow` (these are TenGig3/0/*, which I will call
|
||||||
one which is running _with_ `sFlow` (which I'll call the _experiment_ pair).
|
the _baseline_ pairs) and two which are running _with_ `sFlow` (these are TenGig130/0/*, which I'll
|
||||||
|
call the _experiment_ pairs).
|
||||||
|
|
||||||
```
|
```
|
||||||
comment { L3: IPv4 interfaces }
|
comment { L3: IPv4 interfaces }
|
||||||
@ -109,7 +110,7 @@ mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels
|
|||||||
|
|
||||||
**3. L2 CrossConnect interfaces**
|
**3. L2 CrossConnect interfaces**
|
||||||
|
|
||||||
Here, I will use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig
|
Here, I will also use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig
|
||||||
interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison
|
interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison
|
||||||
on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the
|
on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the
|
||||||
_baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen.
|
_baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen.
|
||||||
@ -172,8 +173,9 @@ it'll accept any ethernet frame, not only those sent to the NIC's own MAC addres
|
|||||||
L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging
|
L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging
|
||||||
connections and looking up FDB entries on the Mellanox switch much, much easier this way.
|
connections and looking up FDB entries on the Mellanox switch much, much easier this way.
|
||||||
|
|
||||||
With all config in place, I run a quick bidirectional loadtest using 256b packets at line rate,
|
With all config in place, but with `sFlow` disabled, I run a quick bidirectional loadtest using 256b
|
||||||
which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS, IPv4, and L2XC. Neat!
|
packets at line rate, which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS,
|
||||||
|
IPv4, and L2XC. Neat!
|
||||||
|
|
||||||
{{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}}
|
{{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}}
|
||||||
|
|
||||||
@ -204,7 +206,7 @@ Here's the bloodbath as seen from T-Rex:
|
|||||||
|
|
||||||
{{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}}
|
{{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}}
|
||||||
|
|
||||||
**Debrief**: When talked throught these issues, we sort of drew the conclusion that it would be much
|
**Debrief**: When we talked through these issues, we sort of drew the conclusion that it would be much
|
||||||
faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the
|
faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the
|
||||||
spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks
|
spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks
|
||||||
are needed, and each worker thread will have its own producer queue.
|
are needed, and each worker thread will have its own producer queue.
|
||||||
@ -286,7 +288,7 @@ the queue/ring in another thread that won't block the dataplane.
|
|||||||
|
|
||||||
**TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_
|
**TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_
|
||||||
|
|
||||||
Neil reports back after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)]
|
Neil checks in after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)]
|
||||||
that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty
|
that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty
|
||||||
elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment
|
elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment
|
||||||
called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new
|
called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new
|
||||||
@ -366,7 +368,7 @@ pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt
|
|||||||
|
|
||||||
With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up
|
With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up
|
||||||
and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using
|
and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using
|
||||||
4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the 'sflow PSAMPLE send failed`
|
4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the `sflow PSAMPLE send failed`
|
||||||
counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.
|
counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.
|
||||||
|
|
||||||
**Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All
|
**Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All
|
||||||
@ -416,23 +418,23 @@ pim@hvn6-lab:~$ grep sflow v4-100-err.txt
|
|||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
But this is starting to be a very nice implementation! With this iteration of the plugin, all the
|
This is starting to be a very nice implementation! With this iteration of the plugin, all the
|
||||||
corruption is gone, there is a slight regression (because we're now actually _sending_ the
|
corruption is gone, there is a slight regression (because we're now actually _sending_ the
|
||||||
messages). With the v3bis variant, only a tiny fraction of the sampels made it through to netlink.
|
messages). With the v3bis variant, only a tiny fraction of the samples made it through to netlink.
|
||||||
With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen
|
With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen
|
||||||
FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying
|
FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying
|
||||||
to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken,
|
to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken,
|
||||||
350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!
|
350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!
|
||||||
|
|
||||||
Doign the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per
|
Doing the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per
|
||||||
interface. I can also see that the second interfae, which is doing L2XC and hits a much larger
|
interface. I can also see that the second interface, which is doing L2XC and hits a much larger
|
||||||
packets/sec throughput, is dropping more samples because it has an equal amount of time from main
|
packets/sec throughput, is dropping more samples because it receives an equal amount of time from main
|
||||||
reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd
|
reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd
|
||||||
out another. Slick.
|
out another. Slick.
|
||||||
|
|
||||||
Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that
|
Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that
|
||||||
main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so the
|
main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so
|
||||||
`sflow PSAMPLE send failed` counter remains zero.
|
the `sflow PSAMPLE send failed` counter remains zero.
|
||||||
|
|
||||||
{{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}}
|
{{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}}
|
||||||
|
|
||||||
@ -446,10 +448,10 @@ on documenting the VPP parts only. But, as a teaser, here's a screenshot of a v
|
|||||||
to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably
|
to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably
|
||||||
expensive to make mistakes.
|
expensive to make mistakes.
|
||||||
|
|
||||||
Neil reveals an itch that he has been meaning to scratch all this time. In VPP's
|
Neil admits to an itch that he has been meaning to scratch all this time. In VPP's
|
||||||
`plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really
|
`plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really
|
||||||
most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To
|
most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To
|
||||||
make use of some CPU instruction cache affinity, the loop that does this shoveling can do it one
|
make use of some CPU instruction cache affinity, the loop that does this shovelling can do it one
|
||||||
packet at a time, two packets at a time, or even four packets at a time. Although the code is super
|
packet at a time, two packets at a time, or even four packets at a time. Although the code is super
|
||||||
repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per
|
repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per
|
||||||
packet, if you shovel four of them at a time.
|
packet, if you shovel four of them at a time.
|
||||||
@ -519,16 +521,17 @@ limit, yielding 25k samples/sec sent to Netlink.
|
|||||||
|
|
||||||
Checking in on the three main things we wanted to ensure with the plugin:
|
Checking in on the three main things we wanted to ensure with the plugin:
|
||||||
|
|
||||||
1. ✅ If `sFlow` is not enabled on a given interface, there is no regression on other interfaces.
|
1. ✅ If `sFlow` _is not_ enabled on a given interface, there is no regression on other interfaces.
|
||||||
1. ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average
|
1. ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average
|
||||||
1. ✅ If `sFlow` samples, it takes only marginally more CPU time to enqueue. No sampling gets
|
1. ✅ If `sFlow` takes a sample, it takes only marginally more CPU time to enqueue.
|
||||||
9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput, while 1:1000 sampling reduces to 9.77Mpps of L3
|
* No sampling gets 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput,
|
||||||
and 14.05Mpps of L2XC throughput. An overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.
|
* 1:1000 sampling reduces to 9.77Mpps of L3 and 14.05Mpps of L2XC throughput,
|
||||||
|
* and an overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.
|
||||||
|
|
||||||
The hard part is done, but we're not entirely done yet. What's left is to implement a set of packet
|
The hard part is finished, but we're not entirely done yet. What's left is to implement a set of
|
||||||
and byte counters, and send this information along with possible Linux CP data (such as the TAP
|
packet and byte counters, and send this information along with possible Linux CP data (such as the
|
||||||
interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about that
|
TAP interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about
|
||||||
part in a followup article.
|
that part in a followup article.
|
||||||
|
|
||||||
Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed
|
Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed
|
||||||
folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the
|
folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the
|
||||||
|
Reference in New Issue
Block a user