From 3c69130cea21779d1e7a0e70c7fb0d0924771f06 Mon Sep 17 00:00:00 2001 From: Pim van Pelt Date: Sun, 6 Oct 2024 18:07:22 +0200 Subject: [PATCH] Add sflow part 2 --- content/articles/2024-10-06-sflow-2.md | 544 ++++++++++++++++++ static/assets/sflow/hsflowd-demo.png | 3 + static/assets/sflow/trex-acceptance.png | 3 + static/assets/sflow/trex-sflow-acceptance.png | 3 + static/assets/sflow/trex-v1.png | 3 + 5 files changed, 556 insertions(+) create mode 100644 content/articles/2024-10-06-sflow-2.md create mode 100644 static/assets/sflow/hsflowd-demo.png create mode 100644 static/assets/sflow/trex-acceptance.png create mode 100644 static/assets/sflow/trex-sflow-acceptance.png create mode 100644 static/assets/sflow/trex-v1.png diff --git a/content/articles/2024-10-06-sflow-2.md b/content/articles/2024-10-06-sflow-2.md new file mode 100644 index 0000000..000a4c6 --- /dev/null +++ b/content/articles/2024-10-06-sflow-2.md @@ -0,0 +1,544 @@ +--- +date: "2024-10-06T07:51:23Z" +title: 'VPP with sFlow - Part 2' +--- + +# Introduction + +{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}} + +Last month, I picked up a project together with Neil McKee of [[inMon](https://inmon.com/)], the +care takers of [[sFlow](https://sflow.org)]: an industry standard technology for monitoring high speed switched +networks. `sFlow` gives complete visibility into the use of networks enabling performance optimization, +accounting/billing for usage, and defense against security threats. + +The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it +forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and +[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the so +called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for a small +portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but also in the +VPP software dataplane, and then _transmit_ these samples using a Linux kernel feature called +[[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)]. This greatly +reduces the complexity of code to be implemented in the forwarding path, while at the same time +bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business logic for +the more complex state keeping, packet marshalling and transmission from the _Agent_ to a central +_Collector_. + +Last month, Neil and I discussed the proof of concept [[ref](https://github.com/sflow/vpp-sflow/)] +and I described this in a [[first article]({{< ref 2024-09-08-sflow-1.md >}})]. Then, we iterated on +the VPP plugin, playing with a few different approaches to strike a balance between performance, code +complexity, and agent features. This article describes our journey. + +## VPP: an sFlow plugin + +There are three things Neil and I specifically take a look at: + +1. If `sFlow` is not enabled on a given interface, there should not be a regression on other +interfaces. +1. If `sFlow` _is_ enabled, but a packet is not sampled, the overhead should be as small as +possible, targetting single digit CPU cycles per packet in overhead. +1. If `sFlow` actually selects a packet for sampling, it should be moved out of the dataplane as +quickly as possible, targetting double digit CPU cycles per sample. + +For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from +a T-Rex loadtester on eight TenGig ports, which I have configured as follows: + +**1. RX Queue Placement** + +It's important that the network card that is receiving the traffic, gets serviced by a worker thread +on the same NUMA domain. Since my machine has two CPUs (and thus, two NUMA nodes), I will align the +NIC with the correct CPU, like so: + +``` +set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0 +set interface rx-placement TenGigabitEthernet3/0/1 queue 0 worker 2 +set interface rx-placement TenGigabitEthernet3/0/2 queue 0 worker 4 +set interface rx-placement TenGigabitEthernet3/0/3 queue 0 worker 6 + +set interface rx-placement TenGigabitEthernet130/0/0 queue 0 worker 1 +set interface rx-placement TenGigabitEthernet130/0/1 queue 0 worker 3 +set interface rx-placement TenGigabitEthernet130/0/2 queue 0 worker 5 +set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7 +``` + +**2. L3 IPv4/MPLS interfaces** + +I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a +comparison with L3 IPv4 or MPLS running without `sFlow` (which I'll call the _baseline_ pair) and +one which is running _with_ `sFlow` (which I'll call the _experiment_ pair). + +``` +comment { L3: IPv4 interfaces } +set int state TenGigabitEthernet3/0/0 up +set int state TenGigabitEthernet3/0/1 up +set int state TenGigabitEthernet130/0/0 up +set int state TenGigabitEthernet130/0/1 up +set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31 +set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31 +set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31 +set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31 +ip route add 16.0.0.0/24 via 100.64.0.0 +ip route add 48.0.0.0/24 via 100.64.1.0 +ip route add 16.0.2.0/24 via 100.64.4.0 +ip route add 48.0.2.0/24 via 100.64.5.0 +ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static +ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static +ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static +ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static +``` + +Here, the only specific trick worth mentioning is the use of `ip neighbor` to pre-populate the L2 +adjacency for the T-Rex loadtester. This way, VPP knows which MAC address to send traffic to, in +case a packet has to be forwarded to 100.64.0.0 or 100.64.5.0. It avoids VPP from having to use ARP +resolution. + +The configuration for an MPLS label switching router _LSR_ or also called _P-Router_ is added: + +``` +comment { MPLS interfaces } +mpls table add 0 +set interface mpls TenGigabitEthernet3/0/0 enable +set interface mpls TenGigabitEthernet3/0/1 enable +set interface mpls TenGigabitEthernet130/0/0 enable +set interface mpls TenGigabitEthernet130/0/1 enable +mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17 +mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16 +mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21 +mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20 +``` + +**3. L2 CrossConnect interfaces** + +Here, I will use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig +interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison +on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the +_baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen. + +``` +comment { L2 xconnected interfaces } +set int state TenGigabitEthernet3/0/2 up +set int state TenGigabitEthernet3/0/3 up +set int state TenGigabitEthernet130/0/2 up +set int state TenGigabitEthernet130/0/3 up +set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3 +set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2 +set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3 +set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2 +``` + +**4. T-Rex Configuration** + +The Cisco T-Rex loadtester is running on another machine in the same rack. Physically, it has eight +ports which are connected to a LAB switch, a cool Mellanox SN2700 running Debian [[ref]({{< ref +2023-11-11-mellanox-sn2700.md >}})]. From there, eight ports go to my VPP machine. The LAB switch +just has VLANs with two ports in each: VLAN 100 takes T-Rex port0 and connects it to TenGig3/0/0, +VLAN 101 takes port1 and connects it to TenGig3/0/1, and so on. In total, sixteen ports and eight +VLANs are used. + +The configuration for T-Rex then becomes: + +``` +- version: 2 + interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ] + port_info: + - src_mac: 00:1b:21:06:00:00 + dest_mac: 9c:69:b4:61:a1:dc + - src_mac: 00:1b:21:06:00:01 + dest_mac: 9c:69:b4:61:a1:dd + + - src_mac: 00:1b:21:83:00:00 + dest_mac: 00:1b:21:83:00:01 + - src_mac: 00:1b:21:83:00:01 + dest_mac: 00:1b:21:83:00:00 + + - src_mac: 00:1b:21:87:00:00 + dest_mac: 9c:69:b4:61:75:d0 + - src_mac: 00:1b:21:87:00:01 + dest_mac: 9c:69:b4:61:75:d1 + + - src_mac: 9c:69:b4:85:00:00 + dest_mac: 9c:69:b4:85:00:01 + - src_mac: 9c:69:b4:85:00:01 + dest_mac: 9c:69:b4:85:00:00 +``` + +Do you see how the first pair sends from `src_mac` 00:1b:21:06:00:00? That's the T-Rex side, and it +encodes the PCI device `06:00.0` in the MAC address. It sends traffic to `dest_mac` +9c:69:b4:61:a1:dc, which is the MAC address of VPP's TenGig3/0/0 interface. Looking back at the `ip +neighbor` VPP config above, it becomes much easier to see who is sending traffic to whom. + +For L2XC, the MAC addresses don't matter. VPP will set the NIC in _promiscuous_ mode which means +it'll accept any ethernet frame, not only those sent to the NIC's own MAC address. Therefore, in +L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging +connections and looking up FDB entries on the Mellanox switch much, much easier this way. + +With all config in place, I run a quick bidirectional loadtest using 256b packets at line rate, +which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS, IPv4, and L2XC. Neat! + +{{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}} + +The name of the game is now to do a loadtest that shows the packet throughput and CPU cycles spent +for each of the plugin iterations, comparing their performance on ports with and without `sFlow` +enabled. For each iteration, I will use exactly the same VPP configuration, I will generate +unidirectional 4x14.88Mpps of traffic with T-Rex, and I will report on VPP's performance in +_baseline_ and a somewhat unfavorable 1:100 sampling rate. + +Ready? Here I go! + +### v1: Workers send RPC to main + +***TL/DR***: _13 cycles/packet on passthrough, 4.68Mpps L2, 3.26Mpps L3, with severe regression in +baseline_ + +The first iteration goes all the way back to a proof of concept from last year. It's described in +detail in my [[first post]({{< ref 2024-09-08-sflow-1.md >}})]. The performance results are not +stellar: +* ☢ When slamming a single sFlow enabled interface, _all interfaces_ regress. When sending 8Mpps +of IPv4 traffic through an _baseline_ interface, that is an interface _without_ sFlow enabled, only +5.2Mpps get through. This is considered a mortal sin in VPP-land. +* ✅ Passing through packets without sampling them, costs about 13 CPU cycles, not bad. +* ❌ Sampling a packet, specifically at higher rates (say, 1:100 or worse, 1:10) completely +destroys throughput. When sending 4x14.88MMpps of traffic, only one third makes it through. + +Here's the bloodbath as seen from T-Rex: + +{{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}} + +**Debrief**: When talked throught these issues, we sort of drew the conclusion that it would be much +faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the +spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks +are needed, and each worker thread will have its own producer queue. + +Then, we can create a separate thread (or even pool of threads), scheduling on possibly a different +CPU (or in main), that runs a loop iterating on all sflow sample queues, consuming the samples and +sending them in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too +many coming in. + +### v2: Workers send PSAMPLE directly + +**TL/DR**: _7.21Mpps IPv4 L3, 9.45Mpps L2XC, 87 cycles/packet, no impact on disabled interfaces_ + +But before we do that, we have one curiosity itch to scratch - what if we sent the sample directly +from the worker? With such a model, if it works, we will need no RPCs or sample queue at all. Of +course, in this model any sample will have to be rewritten into a PSAMPLE packet and written via the +netlink socket. It would be less complex, but not as efficient as it could be. One thing is prety +certain, though: it should be much faster than sending an RPC to the main thread. + +After short refactor, Neil commits [[d278273](https://github.com/sflow/vpp-sflow/commit/d278273)], +which adds compiler macros `SFLOW_SEND_FROM_WORKER` (v2) and `SFLOW_SEND_VIA_MAIN` (v1). When +workers send directly, they will invoke `sflow_send_sample_from_worker()` instead of sending an RPC +with `vl_api_rpc_call_main_thread()` in the previous version. + +The code currently uses `clib_warning()` to print stats from the dataplane, which is pretty +expensive. We should be using the VPP logging framework, but for the time being, I add a few CPU +counters so we can more accurately count the cummulative time spent for each part of the calls, see +[[6ca61d2](https://github.com/sflow/vpp-sflow/commit/6ca61d2)]. I can now see these with `vppctl show +err` instead. + +When loadtesting this, the deadly sin of impacting performance of interfaces that did not have +`sFlow` enabled is gone. The throughput is not great, though. Instead of showing screenshots of +T-Rex, I can also take a look at the throughput as measured by VPP itself. In its `show runtime` +statistics, each worker thread shows both CPU cycles spent, as well as how many packets/sec it +received and how many it transmitted: + +``` +pim@hvn6-lab:~$ export C="v2-100"; vppctl clear run; vppctl clear err; sleep 30; \ + vppctl show run > $C-runtime.txt; vppctl show err > $C-err.txt +pim@hvn6-lab:~$ grep 'vector rates' v2-100-runtime.txt | grep -v 'in 0' + vector rates in 1.0909e7, out 1.0909e7, drop 0.0000e0, punt 0.0000e0 + vector rates in 7.2078e6, out 7.2078e6, drop 0.0000e0, punt 0.0000e0 + vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0 + vector rates in 9.4476e6, out 9.4476e6, drop 0.0000e0, punt 0.0000e0 +pim@hvn6-lab:~$ grep 'sflow' v2-100-runtime.txt +Name State Calls Vectors Suspends Clocks Vectors/Call +sflow active 844916 216298496 0 8.69e1 256.00 +sflow active 1107466 283511296 0 8.26e1 256.00 +pim@hvn6-lab:~$ grep -i sflow v2-100-err.txt + 217929472 sflow sflow packets processed error + 1614519 sflow sflow packets sampled error +2606893106 sflow CPU cycles in sent samples error + 280697344 sflow sflow packets processed error + 2078203 sflow sflow packets sampled error +1844674406 sflow CPU cycles in sent samples error +``` + +At a glance, I can see in the first `grep`, the in and out vector (==packet) rates for each worker +thread that is doing meaningful work (ie. has more than 0pps of input). Remember that I pinned the +RX queues to worker threads, and this now pays dividend: worker thread 0 is servicing TenGig3/0/0 +(as _even_ worker thread numbers are on NUMA domain 0), worker thread 1 is servicing TenGig130/0/0. +What's cool about this, is it gives me an easy way to compare baseline L3 (10.9Mpps) with experiment +L3 (7.21Mpps). Equally, L2XC comes in at 14.88Mpps in baseline and 9.45Mpps in experiment. + +Looking at the output of `vppctl show error`, I can learn another interesting detail. See how there +are 1614519 sampled packets out of 217929472 processed packets (ie. a roughly 1:100 rate)? I added a +CPU clock cycle counter that counts cummulative clocks spent once samples are taken. I can see that +VPP spent 2606893106 CPU cycles sending these samples. That's **1615 CPU cycles** per sent sample, +which is pretty terrible. + +**Debrief**: We both understand that assembling and `send()`ing the netlink messages from within the +dataplane is a pretty bad idea. But it's great to see that removing the use of RPCs immediately +improves performance on non-enabled interfaces, and we learned what the cost is of sending those +samples. An easy step forward from here is to create a producer/consumer queue, where the workers +can just copy the packet into a queue or ring buffer, and have an external `pthread` consume from +the queue/ring in another thread that won't block the dataplane. + +### v3: SVM FIFO from workers, dedicated PSAMPLE pthread + +**TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_ + +Neil reports back after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)] +that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty +elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment +called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new +thread called `spt_process_samples` can then call `svm_fifo_dequeue()` from all workers' queues and +pump those into Netlink. + +The overhead of copying the samples onto a VPP native `svm_fifo` seems to be two orders of magnitude +lower than writing directly to Netlink, even though the `svm_fifo` library code has many bells and +whistles that we don't need. But, perhaps due to these bells and whistles, we may be holding it +wrong, as invariably after a short while the Netlink writes return _Message too long_ errors. + +``` +pim@hvn6-lab:~$ grep 'vector rates' v3fifo-sflow-100-runtime.txt | grep -v 'in 0' + vector rates in 1.0783e7, out 1.0783e7, drop 0.0000e0, punt 0.0000e0 + vector rates in 9.3499e6, out 9.3499e6, drop 0.0000e0, punt 0.0000e0 + vector rates in 1.4728e7, out 1.4728e7, drop 0.0000e0, punt 0.0000e0 + vector rates in 1.3516e7, out 1.3516e7, drop 0.0000e0, punt 0.0000e0 +pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-runtime.txt +Name State Calls Vectors Suspends Clocks Vectors/Call +sflow active 1096132 280609792 0 1.63e1 256.00 +sflow active 1584577 405651712 0 1.46e1 256.00 +pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-err.txt + 280635904 sflow sflow packets processed error + 2079194 sflow sflow packets sampled error + 733447310 sflow CPU cycles in sent samples error + 405689856 sflow sflow packets processed error + 3004118 sflow sflow packets sampled error +1844674407 sflow CPU cycles in sent samples error +``` + +Two things of note here. Firstly, the average clocks spent in the `sFlow` node have gone down from +86 CPU cycles/packet to 16.3 CPU cycles. But even more importantly, the amount of time spent after +the sample is taken is hugely reduced, from 1600+ cycles in v2 to a much more favorable 352 cycles +in this version. Also, any risk of Netlink writes failing has been eliminated, because that's now +offloaded to a different thread entirely. + +**Debrief**: It's not great that we created a new linux `pthread` for the consumer of the samples. +VPP has an elaborate thread management system, and collaborative multitasking in its threading +model, which adds introspection like clock counters, names, `show runtime`, `show threads` and so +on. I can't help but wonder: wouldn't we just be able to move the `spt_process_samples()` thread +into a VPP process node instead? + +### v3bis: SVM FIFO, PSAMPLE process in Main + +**TL/DR:** _9.68Mpps L3, 14.10Mpps L2XC, 14.2 cycles/packet, still with corrupted FIFO queue messages_ + +Neil agrees that there's no good reason to keep this out of main, and conjures up +[[df2dab8d](https://github.com/vpp/sflow-vpp/df2dab8d)] which rewrites the thread to an +`sflow_process_samples()` function, using `VLIB_REGISTER_NODE` to add it to VPP in an idiomatic way. +As a really nice benefit, we can now count how many CPU cycles are spent, in _main_, each time this +_process_ wakes up and does some work. It's a widely used pattern in VPP. + +Because of the FIFO queue message corruption, Netlink message are failing to send at an alarming +rate, which is causing lots of `clib_warning()` messages to be spewed on console. I replace those +with a counter of Failed Netlink messages instead, and commit refactor +[[6ba4715](https://github.com/sflow/vpp-sflow/6ba4715d050f76cfc582055958d50bf4cc8a0ad1)]. + +``` +pim@hvn6-lab:~$ grep 'vector rates' v3bis-100-runtime.txt | grep -v 'in 0' + vector rates in 1.0976e7, out 1.0976e7, drop 0.0000e0, punt 0.0000e0 + vector rates in 9.6743e6, out 9.6743e6, drop 0.0000e0, punt 0.0000e0 + vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0 + vector rates in 1.4052e7, out 1.4052e7, drop 0.0000e0, punt 0.0000e0 +pim@hvn6-lab:~$ grep sflow v3bis-100-runtime.txt +Name State Calls Vectors Suspends Clocks Vectors/Call +sflow-process-samples any wait 0 0 28052 4.66e4 0.00 +sflow active 1134102 290330112 0 1.42e1 256.00 +sflow active 1647240 421693440 0 1.32e1 256.00 +pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt + 77945 sflow sflow PSAMPLE sent error + 863 sflow sflow PSAMPLE send failed error + 290376960 sflow sflow packets processed error + 2151184 sflow sflow packets sampled error + 421761024 sflow sflow packets processed error + 3119625 sflow sflow packets sampled error +``` + +With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up +and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using +4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the 'sflow PSAMPLE send failed` +counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice. + +**Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All +these send failures and corrupt packets are really messing things up. So while the provided FIFO +implementation in `svm/fifo_segment.h` is idiomatic, it is also much more complex than we thought, +and we're fearing that it may not be safe to read from another thread. + +### v4: Custom lockless FIFO, PSAMPLE process in Main + +**TL/DR:** _9.56Mpps L3, 13.69Mpps L2XC, 15.6 cycles/packet, corruption fixed!_ + +After reading around a bit in DPDK's +[[kni_fifo](https://doc.dpdk.org/api-18.11/rte__kni__fifo_8h_source.html)], Neil produces a gem of a +commit in +[[42bbb64](https://github.com/sflow/vpp-sflow/commit/42bbb643b1f11e8498428d3f7d20cde4de8ee201)], +where he introduces a tiny multiple-writer, single-consumer FIFO with two simple functions: +`sflow_fifo_enqueue()` to be called in the workers, and `sflow_fifo_dequeue()` to be called in the +main thread's `sflow-process-samples` process. He then makes this thread-safe by doing what I +consider black magic, in commit +[[dd8af17](https://github.com/sflow/vpp-sflow/commit/dd8af1722d579adc9d08656cd7ec8cf8b9ac11d6)], +which makes use of `clib_atomic_load_acq_n()` and `clib_atomic_store_rel_n()` macros from VPP's +`vppinfra/atomics.h`. + +What I really like about this change is that it introduces a FIFO implementation in about twenty +lines of code, which means the sampling code path in the dataplane becomes really easy to follow, +and will be even faster than it was before. I take it out for a loadtest: + +``` +pim@hvn6-lab:~$ grep 'vector rates' v4-100-runtime.txt | grep -v 'in 0' + vector rates in 1.0958e7, out 1.0958e7, drop 0.0000e0, punt 0.0000e0 + vector rates in 9.5633e6, out 9.5633e6, drop 0.0000e0, punt 0.0000e0 + vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0 + vector rates in 1.3697e7, out 1.3697e7, drop 0.0000e0, punt 0.0000e0 +pim@hvn6-lab:~$ grep sflow v4-100-runtime.txt +Name State Calls Vectors Suspends Clocks Vectors/Call +sflow-process-samples any wait 0 0 17767 1.52e6 0.00 +sflow active 1121156 287015936 0 1.56e1 256.00 +sflow active 1605772 411077632 0 1.53e1 256.00 +pim@hvn6-lab:~$ grep sflow v4-100-err.txt + 3553600 sflow sflow PSAMPLE sent error + 287101184 sflow sflow packets processed error + 2127024 sflow sflow packets sampled error + 350224 sflow sflow packets dropped error + 411199744 sflow sflow packets processed error + 3043693 sflow sflow packets sampled error + 1266893 sflow sflow packets dropped error +``` + + +But this is starting to be a very nice implementation! With this iteration of the plugin, all the +corruption is gone, there is a slight regression (because we're now actually _sending_ the +messages). With the v3bis variant, only a tiny fraction of the sampels made it through to netlink. +With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen +FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying +to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken, +350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth! + +Doign the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per +interface. I can also see that the second interfae, which is doing L2XC and hits a much larger +packets/sec throughput, is dropping more samples because it has an equal amount of time from main +reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd +out another. Slick. + +Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that +main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so the +`sflow PSAMPLE send failed` counter remains zero. + +{{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}} + +**Debrief**: In the mean time, Neil has been working on the `host-sflow` daemon changes to pick up +these netlink messages. There's also a bit of work to do with retrieving the packet and byte +counters of the VPP interfaces, so he is creating a module in `host-sflow` that can consume some +messages from VPP. He will call this `mod_vpp`, and he mails a screenshot of his work in progress. +I'll discuss the end-to-end changes with `hsflowd` in a followup article, and focus my efforts here +on documenting the VPP parts only. But, as a teaser, here's a screenshot of a validated +`sflow-tool` output of a VPP instance using our `sFlow` plugin and his pending `host-sflow` changes +to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably +expensive to make mistakes. + +Neil reveals an itch that he has been meaning to scratch all this time. In VPP's +`plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really +most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To +make use of some CPU instruction cache affinity, the loop that does this shoveling can do it one +packet at a time, two packets at a time, or even four packets at a time. Although the code is super +repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per +packet, if you shovel four of them at a time. + +### v5: Quad Bucket Brigade in worker + +**TL/DR:** _9.68Mpps L3, 14.0Mpps L2XC, 11 CPU cycles/packet, 1.28e5 CPU cycles in main_ + +Neil calls this the _Quad Bucket Brigade_, and one last finishing touch is to move from his default +2-packet to a 4-packet shoveling. In commit +[[285d8a0](https://github.com/sflow/vpp-sflow/commit/285d8a097b74bb38eeb14a922a1e8c1115da2ef2)], he +extends a common pattern in VPP dataplane nodes, each time the node iterates, it'll pre-fetch now up +to eight packets (`p0-p7`) if the vector is long enough, and handle them four at a time (`b0-b3`). +He also adds a few compiler hints with branch prediction: almost no packets will have a trace +enabled, so he can use `PREDICT_FALSE()` macros to allow the compiler to further optimize the code. + +I find reading the dataplane code, that it is incredibly ugly. But it's the price to pay for ultra +fast throughput. But how do we see the effect? My low-tech proposal is to enable sampling at a very +high rate, say 1:10'000'000, so that the code path that grabs and enqueues the sample into the FIFO +is almost never called. Then, what's left for the `sFlow` dataplane node, really is to shovel the +packets from `device-input` into `ethernet-input`. + +To measure the relative improvement, I do one test with, and one without commit +[[285d8a09](https://github.com/sflow/vpp-sflow/commit/285d8a09)]. + +``` +pim@hvn6-lab:~$ grep 'vector rates' v5-10M-runtime.txt | grep -v 'in 0' + vector rates in 1.0981e7, out 1.0981e7, drop 0.0000e0, punt 0.0000e0 + vector rates in 9.8806e6, out 9.8806e6, drop 0.0000e0, punt 0.0000e0 + vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0 + vector rates in 1.4328e7, out 1.4328e7, drop 0.0000e0, punt 0.0000e0 +pim@hvn6-lab:~$ grep sflow v5-10M-runtime.txt +Name State Calls Vectors Suspends Clocks Vectors/Call +sflow-process-samples any wait 0 0 28467 9.36e3 0.00 +sflow active 1158325 296531200 0 1.09e1 256.00 +sflow active 1679742 430013952 0 1.11e1 256.00 + +pim@hvn6-lab:~$ grep 'vector rates' v5-noquadbrigade-10M-runtime.txt | grep -v in\ 0 + vector rates in 1.0959e7, out 1.0959e7, drop 0.0000e0, punt 0.0000e0 + vector rates in 9.7046e6, out 9.7046e6, drop 0.0000e0, punt 0.0000e0 + vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0 + vector rates in 1.4008e7, out 1.4008e7, drop 0.0000e0, punt 0.0000e0 +pim@hvn6-lab:~$ grep sflow v5-noquadbrigade-10M-runtime.txt +Name State Calls Vectors Suspends Clocks Vectors/Call +sflow-process-samples any wait 0 0 28462 9.57e3 0.00 +sflow active 1137571 291218176 0 1.26e1 256.00 +sflow active 1641991 420349696 0 1.20e1 256.00 +``` + +Would you look at that, this optimization actually works as advertised! There is a meaningful +_progression_ from v5-noquadbrigade (9.70Mpps L3, 14.00Mpps L2XC) to v5 (9.88Mpps L3, 14.32Mpps +L2XC). So at the expense of adding 63 lines of code, there is a 2.8% increase in throughput. +**Quad-Bucket-Brigade, yaay!** + +I'll leave you with a beautiful screenshot of the current code at HEAD, as it is sampling 1:100 +packets (!) on four interfaces, while forwarding 8x10G of 256 byte packets at line rate. You'll +recall at the beginning of this article I did an acceptance loadtest with sFlow disabled, but this +is the exact same result **with sFlow** enabled: + +{{< image src="/assets/sflow/trex-sflow-acceptance.png" alt="T-Rex sFlow Acceptance Loadtest" >}} + +This picture says it all: 79.98 Gbps in, 79.98 Gbps out; 36.22Mpps in, 36.22Mpps out. Also: 176k +samples/sec taken from the dataplane, with correct rate limiting due to a per-worker FIFO depth +limit, yielding 25k samples/sec sent to Netlink. + +## What's Next + +Checking in on the three main things we wanted to ensure with the plugin: + +1. ✅ If `sFlow` is not enabled on a given interface, there is no regression on other interfaces. +1. ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average +1. ✅ If `sFlow` samples, it takes only marginally more CPU time to enqueue. No sampling gets +9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput, while 1:1000 sampling reduces to 9.77Mpps of L3 +and 14.05Mpps of L2XC throughput. An overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only. + +The hard part is done, but we're not entirely done yet. What's left is to implement a set of packet +and byte counters, and send this information along with possible Linux CP data (such as the TAP +interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about that +part in a followup article. + +Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed +folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the +ecosystem. Our work so far is captured in Gerrit [[41680](https://gerrit.fd.io/r/c/vpp/+/41680)], +which ends up being just over 2600 lines all-up. I do think we need to refactor a bit, add some +VPP-specific tidbits like `FEATURE.yaml` and `*.rst` documentation, but this should be in reasonable +shape. + +### Acknowledgements + +I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the +finer details such as logging, error handling, API specifications, and documentation. He has been a +true pleasure to work with and learn from. diff --git a/static/assets/sflow/hsflowd-demo.png b/static/assets/sflow/hsflowd-demo.png new file mode 100644 index 0000000..6788e9e --- /dev/null +++ b/static/assets/sflow/hsflowd-demo.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:07bc2be08e8682a7e630abae448b8f44cd1a8a97351702d3ffc982b16fc49987 +size 1036446 diff --git a/static/assets/sflow/trex-acceptance.png b/static/assets/sflow/trex-acceptance.png new file mode 100644 index 0000000..bbd33ac --- /dev/null +++ b/static/assets/sflow/trex-acceptance.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4e5880a58d5d021e789ae80b5ecbe3e5afee0738baa89072e521d494d6497c77 +size 281910 diff --git a/static/assets/sflow/trex-sflow-acceptance.png b/static/assets/sflow/trex-sflow-acceptance.png new file mode 100644 index 0000000..40a8b73 --- /dev/null +++ b/static/assets/sflow/trex-sflow-acceptance.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:1ed1abbcdd77eee6260db4b0bf22e5a961990ee6db91bcdb72a68cb3ade4b131 +size 297538 diff --git a/static/assets/sflow/trex-v1.png b/static/assets/sflow/trex-v1.png new file mode 100644 index 0000000..0a85b54 --- /dev/null +++ b/static/assets/sflow/trex-v1.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b893e4c8405436fc1bb2268474875b4879ce9ef03c317ae8414324343bd060fa +size 718915