From ba068c1c52310436bd7bff4afe2916c35b48e744 Mon Sep 17 00:00:00 2001 From: Pim van Pelt Date: Sun, 6 Oct 2024 19:11:42 +0200 Subject: [PATCH] Some typo and readability fixes --- content/articles/2024-10-06-sflow-2.md | 59 ++++++++++++++------------ 1 file changed, 31 insertions(+), 28 deletions(-) diff --git a/content/articles/2024-10-06-sflow-2.md b/content/articles/2024-10-06-sflow-2.md index 000a4c6..28a9eb0 100644 --- a/content/articles/2024-10-06-sflow-2.md +++ b/content/articles/2024-10-06-sflow-2.md @@ -41,13 +41,13 @@ possible, targetting single digit CPU cycles per packet in overhead. quickly as possible, targetting double digit CPU cycles per sample. For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from -a T-Rex loadtester on eight TenGig ports, which I have configured as follows: +a T-Rex loadtester on eight TenGig ports. I have configured VPP and T-Rex as follows. **1. RX Queue Placement** It's important that the network card that is receiving the traffic, gets serviced by a worker thread -on the same NUMA domain. Since my machine has two CPUs (and thus, two NUMA nodes), I will align the -NIC with the correct CPU, like so: +on the same NUMA domain. Since my machine has two processors (and thus, two NUMA nodes), I will +align the NIC with the correct processor, like so: ``` set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0 @@ -64,8 +64,9 @@ set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7 **2. L3 IPv4/MPLS interfaces** I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a -comparison with L3 IPv4 or MPLS running without `sFlow` (which I'll call the _baseline_ pair) and -one which is running _with_ `sFlow` (which I'll call the _experiment_ pair). +comparison with L3 IPv4 or MPLS running _without_ `sFlow` (these are TenGig3/0/*, which I will call +the _baseline_ pairs) and two which are running _with_ `sFlow` (these are TenGig130/0/*, which I'll +call the _experiment_ pairs). ``` comment { L3: IPv4 interfaces } @@ -109,7 +110,7 @@ mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels **3. L2 CrossConnect interfaces** -Here, I will use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig +Here, I will also use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the _baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen. @@ -172,8 +173,9 @@ it'll accept any ethernet frame, not only those sent to the NIC's own MAC addres L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging connections and looking up FDB entries on the Mellanox switch much, much easier this way. -With all config in place, I run a quick bidirectional loadtest using 256b packets at line rate, -which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS, IPv4, and L2XC. Neat! +With all config in place, but with `sFlow` disabled, I run a quick bidirectional loadtest using 256b +packets at line rate, which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS, +IPv4, and L2XC. Neat! {{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}} @@ -204,7 +206,7 @@ Here's the bloodbath as seen from T-Rex: {{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}} -**Debrief**: When talked throught these issues, we sort of drew the conclusion that it would be much +**Debrief**: When we talked through these issues, we sort of drew the conclusion that it would be much faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks are needed, and each worker thread will have its own producer queue. @@ -286,7 +288,7 @@ the queue/ring in another thread that won't block the dataplane. **TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_ -Neil reports back after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)] +Neil checks in after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)] that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new @@ -366,7 +368,7 @@ pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using -4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the 'sflow PSAMPLE send failed` +4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the `sflow PSAMPLE send failed` counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice. **Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All @@ -416,23 +418,23 @@ pim@hvn6-lab:~$ grep sflow v4-100-err.txt ``` -But this is starting to be a very nice implementation! With this iteration of the plugin, all the +This is starting to be a very nice implementation! With this iteration of the plugin, all the corruption is gone, there is a slight regression (because we're now actually _sending_ the -messages). With the v3bis variant, only a tiny fraction of the sampels made it through to netlink. +messages). With the v3bis variant, only a tiny fraction of the samples made it through to netlink. With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken, 350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth! -Doign the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per -interface. I can also see that the second interfae, which is doing L2XC and hits a much larger -packets/sec throughput, is dropping more samples because it has an equal amount of time from main +Doing the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per +interface. I can also see that the second interface, which is doing L2XC and hits a much larger +packets/sec throughput, is dropping more samples because it receives an equal amount of time from main reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd out another. Slick. Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that -main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so the -`sflow PSAMPLE send failed` counter remains zero. +main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so +the `sflow PSAMPLE send failed` counter remains zero. {{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}} @@ -446,10 +448,10 @@ on documenting the VPP parts only. But, as a teaser, here's a screenshot of a v to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably expensive to make mistakes. -Neil reveals an itch that he has been meaning to scratch all this time. In VPP's +Neil admits to an itch that he has been meaning to scratch all this time. In VPP's `plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To -make use of some CPU instruction cache affinity, the loop that does this shoveling can do it one +make use of some CPU instruction cache affinity, the loop that does this shovelling can do it one packet at a time, two packets at a time, or even four packets at a time. Although the code is super repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per packet, if you shovel four of them at a time. @@ -519,16 +521,17 @@ limit, yielding 25k samples/sec sent to Netlink. Checking in on the three main things we wanted to ensure with the plugin: -1. ✅ If `sFlow` is not enabled on a given interface, there is no regression on other interfaces. +1. ✅ If `sFlow` _is not_ enabled on a given interface, there is no regression on other interfaces. 1. ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average -1. ✅ If `sFlow` samples, it takes only marginally more CPU time to enqueue. No sampling gets -9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput, while 1:1000 sampling reduces to 9.77Mpps of L3 -and 14.05Mpps of L2XC throughput. An overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only. +1. ✅ If `sFlow` takes a sample, it takes only marginally more CPU time to enqueue. + * No sampling gets 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput, + * 1:1000 sampling reduces to 9.77Mpps of L3 and 14.05Mpps of L2XC throughput, + * and an overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only. -The hard part is done, but we're not entirely done yet. What's left is to implement a set of packet -and byte counters, and send this information along with possible Linux CP data (such as the TAP -interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about that -part in a followup article. +The hard part is finished, but we're not entirely done yet. What's left is to implement a set of +packet and byte counters, and send this information along with possible Linux CP data (such as the +TAP interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about +that part in a followup article. Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the