From 3c69130cea21779d1e7a0e70c7fb0d0924771f06 Mon Sep 17 00:00:00 2001
From: Pim van Pelt <pim@ipng.ch>
Date: Sun, 6 Oct 2024 18:07:22 +0200
Subject: [PATCH] Add sflow part 2

---
 content/articles/2024-10-06-sflow-2.md        | 544 ++++++++++++++++++
 static/assets/sflow/hsflowd-demo.png          |   3 +
 static/assets/sflow/trex-acceptance.png       |   3 +
 static/assets/sflow/trex-sflow-acceptance.png |   3 +
 static/assets/sflow/trex-v1.png               |   3 +
 5 files changed, 556 insertions(+)
 create mode 100644 content/articles/2024-10-06-sflow-2.md
 create mode 100644 static/assets/sflow/hsflowd-demo.png
 create mode 100644 static/assets/sflow/trex-acceptance.png
 create mode 100644 static/assets/sflow/trex-sflow-acceptance.png
 create mode 100644 static/assets/sflow/trex-v1.png

diff --git a/content/articles/2024-10-06-sflow-2.md b/content/articles/2024-10-06-sflow-2.md
new file mode 100644
index 0000000..000a4c6
--- /dev/null
+++ b/content/articles/2024-10-06-sflow-2.md
@@ -0,0 +1,544 @@
+---
+date: "2024-10-06T07:51:23Z"
+title: 'VPP with sFlow - Part 2'
+---
+
+# Introduction
+
+{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
+
+Last month, I picked up a project together with Neil McKee of [[inMon](https://inmon.com/)], the
+care takers of [[sFlow](https://sflow.org)]: an industry standard technology for monitoring high speed switched
+networks. `sFlow` gives complete visibility into the use of networks enabling performance optimization,
+accounting/billing for usage, and defense against security threats.
+
+The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
+forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
+[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the so
+called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for a small
+portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but also in the
+VPP software dataplane, and then _transmit_ these samples using a Linux kernel feature called
+[[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)]. This greatly
+reduces the complexity of code to be implemented in the forwarding path, while at the same time
+bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business logic for
+the more complex state keeping, packet marshalling and transmission from the _Agent_ to a central
+_Collector_.
+
+Last month, Neil and I discussed the proof of concept [[ref](https://github.com/sflow/vpp-sflow/)]
+and I described this in a [[first article]({{< ref 2024-09-08-sflow-1.md >}})]. Then, we iterated on
+the VPP plugin, playing with a few different approaches to strike a balance between performance, code
+complexity, and agent features. This article describes our journey.
+
+## VPP: an sFlow plugin
+
+There are three things Neil and I specifically take a look at:
+
+1.  If `sFlow` is not enabled on a given interface, there should not be a regression on other
+interfaces.
+1.  If `sFlow` _is_ enabled, but a packet is not sampled, the overhead should be as small as
+possible, targetting single digit CPU cycles per packet in overhead.
+1.  If `sFlow` actually selects a packet for sampling, it should be moved out of the dataplane as
+quickly as possible, targetting double digit CPU cycles per sample.
+
+For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from
+a T-Rex loadtester on eight TenGig ports, which I have configured as follows:
+
+**1. RX Queue Placement**
+
+It's important that the network card that is receiving the traffic, gets serviced by a worker thread
+on the same NUMA domain. Since my machine has two CPUs (and thus, two NUMA nodes), I will align the
+NIC with the correct CPU, like so:
+
+```
+set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0
+set interface rx-placement TenGigabitEthernet3/0/1 queue 0 worker 2
+set interface rx-placement TenGigabitEthernet3/0/2 queue 0 worker 4
+set interface rx-placement TenGigabitEthernet3/0/3 queue 0 worker 6
+
+set interface rx-placement TenGigabitEthernet130/0/0 queue 0 worker 1
+set interface rx-placement TenGigabitEthernet130/0/1 queue 0 worker 3
+set interface rx-placement TenGigabitEthernet130/0/2 queue 0 worker 5
+set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7
+```
+
+**2. L3 IPv4/MPLS interfaces**
+
+I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a
+comparison with L3 IPv4 or MPLS running without `sFlow` (which I'll call the _baseline_ pair) and
+one which is running _with_ `sFlow` (which I'll call the _experiment_ pair).
+
+```
+comment { L3: IPv4 interfaces }
+set int state TenGigabitEthernet3/0/0 up
+set int state TenGigabitEthernet3/0/1 up
+set int state TenGigabitEthernet130/0/0 up
+set int state TenGigabitEthernet130/0/1 up
+set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
+set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
+set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
+set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
+ip route add 16.0.0.0/24 via 100.64.0.0
+ip route add 48.0.0.0/24 via 100.64.1.0
+ip route add 16.0.2.0/24 via 100.64.4.0
+ip route add 48.0.2.0/24 via 100.64.5.0
+ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
+ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
+ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
+ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
+```
+
+Here, the only specific trick worth mentioning is the use of `ip neighbor` to pre-populate the L2
+adjacency for the T-Rex loadtester. This way, VPP knows which MAC address to send traffic to, in
+case a packet has to be forwarded to 100.64.0.0 or 100.64.5.0. It avoids VPP from having to use ARP
+resolution.
+
+The configuration for an MPLS label switching router _LSR_ or also called _P-Router_ is added:
+
+```
+comment { MPLS interfaces }
+mpls table add 0
+set interface mpls TenGigabitEthernet3/0/0 enable
+set interface mpls TenGigabitEthernet3/0/1 enable
+set interface mpls TenGigabitEthernet130/0/0 enable
+set interface mpls TenGigabitEthernet130/0/1 enable
+mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
+mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
+mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
+mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
+```
+
+**3. L2 CrossConnect interfaces**
+
+Here, I will use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig
+interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison
+on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the
+_baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen.
+
+```
+comment { L2 xconnected interfaces }
+set int state TenGigabitEthernet3/0/2 up
+set int state TenGigabitEthernet3/0/3 up
+set int state TenGigabitEthernet130/0/2 up
+set int state TenGigabitEthernet130/0/3 up
+set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
+set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
+set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
+set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
+```
+
+**4. T-Rex Configuration**
+
+The Cisco T-Rex loadtester is running on another machine in the same rack. Physically, it has eight
+ports which are connected to a LAB switch, a cool Mellanox SN2700 running Debian [[ref]({{< ref
+2023-11-11-mellanox-sn2700.md >}})]. From there, eight ports go to my VPP machine. The LAB switch
+just has VLANs with two ports in each: VLAN 100 takes T-Rex port0 and connects it to TenGig3/0/0,
+VLAN 101 takes port1 and connects it to TenGig3/0/1, and so on. In total, sixteen ports and eight
+VLANs are used.
+
+The configuration for T-Rex then becomes:
+
+```
+- version: 2
+  interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
+  port_info:
+    - src_mac:  00:1b:21:06:00:00
+      dest_mac: 9c:69:b4:61:a1:dc
+    - src_mac:  00:1b:21:06:00:01
+      dest_mac: 9c:69:b4:61:a1:dd
+
+    - src_mac:  00:1b:21:83:00:00
+      dest_mac: 00:1b:21:83:00:01
+    - src_mac:  00:1b:21:83:00:01
+      dest_mac: 00:1b:21:83:00:00
+
+    - src_mac:  00:1b:21:87:00:00
+      dest_mac: 9c:69:b4:61:75:d0
+    - src_mac:  00:1b:21:87:00:01
+      dest_mac: 9c:69:b4:61:75:d1
+
+    - src_mac:  9c:69:b4:85:00:00
+      dest_mac: 9c:69:b4:85:00:01
+    - src_mac:  9c:69:b4:85:00:01
+      dest_mac: 9c:69:b4:85:00:00
+```
+
+Do you see how the first pair sends from `src_mac` 00:1b:21:06:00:00? That's the T-Rex side, and it
+encodes the PCI device `06:00.0` in the MAC address. It sends traffic to `dest_mac`
+9c:69:b4:61:a1:dc, which is the MAC address of VPP's TenGig3/0/0 interface. Looking back at the `ip
+neighbor` VPP config above, it becomes much easier to see who is sending traffic to whom.
+
+For L2XC, the MAC addresses don't matter. VPP will set the NIC in _promiscuous_ mode which means
+it'll accept any ethernet frame, not only those sent to the NIC's own MAC address. Therefore, in
+L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging
+connections and looking up FDB entries on the Mellanox switch much, much easier this way.
+
+With all config in place, I run a quick bidirectional loadtest using 256b packets at line rate,
+which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS, IPv4, and L2XC. Neat!
+
+{{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}}
+
+The name of the game is now to do a loadtest that shows the packet throughput and CPU cycles spent
+for each of the plugin iterations, comparing their performance on ports with and without `sFlow`
+enabled. For each iteration, I will use exactly the same VPP configuration, I will generate
+unidirectional 4x14.88Mpps of traffic with T-Rex, and I will report on VPP's performance in
+_baseline_ and a somewhat unfavorable 1:100 sampling rate.
+
+Ready? Here I go!
+
+### v1: Workers send RPC to main
+
+***TL/DR***: _13 cycles/packet on passthrough, 4.68Mpps L2, 3.26Mpps L3, with severe regression in
+baseline_
+
+The first iteration goes all the way back to a proof of concept from last year. It's described in
+detail in my [[first post]({{< ref 2024-09-08-sflow-1.md >}})]. The performance results are not
+stellar:
+*   ☢ When slamming a single sFlow enabled interface, _all interfaces_ regress. When sending 8Mpps
+of IPv4 traffic through an _baseline_ interface, that is an interface _without_ sFlow enabled, only
+5.2Mpps get through. This is considered a mortal sin in VPP-land.
+*   ✅ Passing through packets without sampling them, costs about 13 CPU cycles, not bad.
+*   ❌ Sampling a packet, specifically at higher rates (say, 1:100 or worse, 1:10) completely
+destroys throughput. When sending 4x14.88MMpps of traffic, only one third makes it through.
+
+Here's the bloodbath as seen from T-Rex:
+
+{{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}}
+
+**Debrief**: When talked throught these issues, we sort of drew the conclusion that it would be much
+faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the
+spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks
+are needed, and each worker thread will have its own producer queue.
+
+Then, we can create a separate thread (or even pool of threads), scheduling on possibly a different
+CPU (or in main), that runs a loop iterating on all sflow sample queues, consuming the samples and
+sending them in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too
+many coming in.
+
+### v2: Workers send PSAMPLE directly
+
+**TL/DR**: _7.21Mpps IPv4 L3, 9.45Mpps L2XC, 87 cycles/packet, no impact on disabled interfaces_
+
+But before we do that, we have one curiosity itch to scratch - what if we sent the sample directly
+from the worker? With such a model, if it works, we will need no RPCs or sample queue at all. Of
+course, in this model any sample will have to be rewritten into a PSAMPLE packet and written via the
+netlink socket. It would be less complex, but not as efficient as it could be. One thing is prety
+certain, though: it should be much faster than sending an RPC to the main thread.
+
+After short refactor, Neil commits [[d278273](https://github.com/sflow/vpp-sflow/commit/d278273)],
+which adds compiler macros `SFLOW_SEND_FROM_WORKER` (v2) and `SFLOW_SEND_VIA_MAIN` (v1). When
+workers send directly, they will invoke `sflow_send_sample_from_worker()` instead of sending an RPC
+with `vl_api_rpc_call_main_thread()` in the previous version.
+
+The code currently uses `clib_warning()` to print stats from the dataplane, which is pretty
+expensive. We should be using the VPP logging framework, but for the time being, I add a few CPU
+counters so we can more accurately count the cummulative time spent for each part of the calls, see
+[[6ca61d2](https://github.com/sflow/vpp-sflow/commit/6ca61d2)]. I can now see these with `vppctl show
+err` instead.
+
+When loadtesting this, the deadly sin of impacting performance of interfaces that did not have
+`sFlow` enabled is gone. The throughput is not great, though. Instead of showing screenshots of
+T-Rex, I can also take a look at the throughput as measured by VPP itself. In its `show runtime`
+statistics, each worker thread shows both CPU cycles spent, as well as how many packets/sec it
+received and how many it transmitted:
+
+```
+pim@hvn6-lab:~$ export C="v2-100"; vppctl clear run; vppctl clear err; sleep 30; \
+                vppctl show run > $C-runtime.txt; vppctl show err > $C-err.txt
+pim@hvn6-lab:~$ grep 'vector rates' v2-100-runtime.txt  | grep -v 'in 0'
+  vector rates in 1.0909e7, out 1.0909e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 7.2078e6, out 7.2078e6, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 9.4476e6, out 9.4476e6, drop 0.0000e0, punt 0.0000e0
+pim@hvn6-lab:~$ grep 'sflow' v2-100-runtime.txt 
+Name            State        Calls      Vectors  Suspends  Clocks    Vectors/Call  
+sflow           active      844916    216298496         0  8.69e1          256.00
+sflow           active     1107466    283511296         0  8.26e1          256.00
+pim@hvn6-lab:~$ grep -i sflow v2-100-err.txt 
+ 217929472               sflow                     sflow packets processed         error  
+   1614519               sflow                      sflow packets sampled          error  
+2606893106               sflow                    CPU cycles in sent samples       error  
+ 280697344               sflow                     sflow packets processed         error  
+   2078203               sflow                      sflow packets sampled          error  
+1844674406               sflow                    CPU cycles in sent samples       error  
+```
+
+At a glance, I can see in the first `grep`, the in and out vector (==packet) rates for each worker
+thread that is doing meaningful work (ie. has more than 0pps of input). Remember that I pinned the
+RX queues to worker threads, and this now pays dividend: worker thread 0 is servicing TenGig3/0/0
+(as _even_ worker thread numbers are on NUMA domain 0), worker thread 1 is servicing TenGig130/0/0.
+What's cool about this, is it gives me an easy way to compare baseline L3 (10.9Mpps) with experiment
+L3 (7.21Mpps). Equally, L2XC comes in at 14.88Mpps in baseline and 9.45Mpps in experiment.
+
+Looking at the output of `vppctl show error`, I can learn another interesting detail. See how there
+are 1614519 sampled packets out of 217929472 processed packets (ie. a roughly 1:100 rate)? I added a
+CPU clock cycle counter that counts cummulative clocks spent once samples are taken. I can see that
+VPP spent 2606893106 CPU cycles sending these samples. That's **1615 CPU cycles** per sent sample,
+which is pretty terrible.
+
+**Debrief**: We both understand that assembling and `send()`ing the netlink messages from within the
+dataplane is a pretty bad idea. But it's great to see that removing the use of RPCs immediately
+improves performance on non-enabled interfaces, and we learned what the cost is of sending those
+samples. An easy step forward from here is to create a producer/consumer queue, where the workers
+can just copy the packet into a queue or ring buffer, and have an external `pthread` consume from
+the queue/ring in another thread that won't block the dataplane.
+
+### v3: SVM FIFO from workers, dedicated PSAMPLE pthread
+
+**TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_
+
+Neil reports back after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)]
+that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty
+elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment
+called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new
+thread called `spt_process_samples` can then call `svm_fifo_dequeue()` from all workers' queues and
+pump those into Netlink.
+
+The overhead of copying the samples onto a VPP native `svm_fifo` seems to be two orders of magnitude
+lower than writing directly to Netlink, even though the `svm_fifo` library code has many bells and
+whistles that we don't need. But, perhaps due to these bells and whistles, we may be holding it
+wrong, as invariably after a short while the Netlink writes return _Message too long_ errors.
+
+```
+pim@hvn6-lab:~$ grep 'vector rates' v3fifo-sflow-100-runtime.txt  | grep -v 'in 0'
+  vector rates in 1.0783e7, out 1.0783e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 9.3499e6, out 9.3499e6, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4728e7, out 1.4728e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.3516e7, out 1.3516e7, drop 0.0000e0, punt 0.0000e0
+pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-runtime.txt 
+Name            State        Calls      Vectors  Suspends  Clocks    Vectors/Call  
+sflow           active     1096132    280609792         0  1.63e1          256.00
+sflow           active     1584577    405651712         0  1.46e1          256.00
+pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-err.txt 
+ 280635904               sflow                     sflow packets processed         error  
+   2079194               sflow                      sflow packets sampled          error  
+ 733447310               sflow                    CPU cycles in sent samples       error  
+ 405689856               sflow                     sflow packets processed         error  
+   3004118               sflow                      sflow packets sampled          error  
+1844674407               sflow                    CPU cycles in sent samples       error  
+```
+
+Two things of note here. Firstly, the average clocks spent in the `sFlow` node have gone down from
+86 CPU cycles/packet to 16.3 CPU cycles. But even more importantly, the amount of time spent after
+the sample is taken is hugely reduced, from 1600+ cycles in v2 to a much more favorable 352 cycles
+in this version. Also, any risk of Netlink writes failing has been eliminated, because that's now
+offloaded to a different thread entirely.
+
+**Debrief**: It's not great that we created a new linux `pthread` for the consumer of the samples.
+VPP has an elaborate thread management system, and collaborative multitasking in its threading
+model, which adds introspection like clock counters, names, `show runtime`, `show threads` and so
+on. I can't help but wonder: wouldn't we just be able to move the `spt_process_samples()` thread
+into a VPP process node instead?
+
+### v3bis: SVM FIFO, PSAMPLE process in Main
+
+**TL/DR:** _9.68Mpps L3, 14.10Mpps L2XC, 14.2 cycles/packet, still with corrupted FIFO queue messages_
+
+Neil agrees that there's no good reason to keep this out of main, and conjures up
+[[df2dab8d](https://github.com/vpp/sflow-vpp/df2dab8d)] which rewrites the thread to an
+`sflow_process_samples()` function, using `VLIB_REGISTER_NODE` to add it to VPP in an idiomatic way.
+As a really nice benefit, we can now count how many CPU cycles are spent, in _main_, each time this
+_process_ wakes up and does some work. It's a widely used pattern in VPP.
+
+Because of the FIFO queue message corruption, Netlink message are failing to send at an alarming
+rate, which is causing lots of `clib_warning()` messages to be spewed on console. I replace those
+with a counter of Failed Netlink messages instead, and commit refactor
+[[6ba4715](https://github.com/sflow/vpp-sflow/6ba4715d050f76cfc582055958d50bf4cc8a0ad1)].
+
+```
+pim@hvn6-lab:~$ grep 'vector rates' v3bis-100-runtime.txt  | grep -v 'in 0'
+  vector rates in 1.0976e7, out 1.0976e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 9.6743e6, out 9.6743e6, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4052e7, out 1.4052e7, drop 0.0000e0, punt 0.0000e0
+pim@hvn6-lab:~$ grep sflow v3bis-100-runtime.txt 
+Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
+sflow-process-samples  any wait        0          0     28052  4.66e4            0.00
+sflow                  active    1134102  290330112         0  1.42e1          256.00
+sflow                  active    1647240  421693440         0  1.32e1          256.00
+pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt 
+     77945               sflow                        sflow PSAMPLE sent           error  
+       863               sflow                    sflow PSAMPLE send failed        error  
+ 290376960               sflow                     sflow packets processed         error  
+   2151184               sflow                      sflow packets sampled          error  
+ 421761024               sflow                     sflow packets processed         error  
+   3119625               sflow                      sflow packets sampled          error  
+```
+
+With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up
+and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using
+4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the 'sflow PSAMPLE send failed`
+counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.
+
+**Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All
+these send failures and corrupt packets are really messing things up. So while the provided FIFO
+implementation in `svm/fifo_segment.h` is idiomatic, it is also much more complex than we thought,
+and we're fearing that it may not be safe to read from another thread.
+
+### v4: Custom lockless FIFO, PSAMPLE process in Main
+
+**TL/DR:** _9.56Mpps L3, 13.69Mpps L2XC, 15.6 cycles/packet, corruption fixed!_
+
+After reading around a bit in DPDK's
+[[kni_fifo](https://doc.dpdk.org/api-18.11/rte__kni__fifo_8h_source.html)], Neil produces a gem of a
+commit in
+[[42bbb64](https://github.com/sflow/vpp-sflow/commit/42bbb643b1f11e8498428d3f7d20cde4de8ee201)],
+where he introduces a tiny multiple-writer, single-consumer FIFO with two simple functions:
+`sflow_fifo_enqueue()` to be called in the workers, and `sflow_fifo_dequeue()` to be called in the
+main thread's `sflow-process-samples` process. He then makes this thread-safe by doing what I
+consider black magic, in commit
+[[dd8af17](https://github.com/sflow/vpp-sflow/commit/dd8af1722d579adc9d08656cd7ec8cf8b9ac11d6)],
+which makes use of `clib_atomic_load_acq_n()` and `clib_atomic_store_rel_n()` macros from VPP's
+`vppinfra/atomics.h`.
+
+What I really like about this change is that it introduces a FIFO implementation in about twenty
+lines of code, which means the sampling code path in the dataplane becomes really easy to follow,
+and will be even faster than it was before. I take it out for a loadtest:
+
+```
+pim@hvn6-lab:~$ grep 'vector rates' v4-100-runtime.txt  | grep -v 'in 0'
+  vector rates in 1.0958e7, out 1.0958e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 9.5633e6, out 9.5633e6, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.3697e7, out 1.3697e7, drop 0.0000e0, punt 0.0000e0
+pim@hvn6-lab:~$ grep sflow v4-100-runtime.txt 
+Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
+sflow-process-samples  any wait        0          0     17767  1.52e6            0.00
+sflow                  active    1121156  287015936         0  1.56e1          256.00
+sflow                  active    1605772  411077632         0  1.53e1          256.00
+pim@hvn6-lab:~$ grep sflow v4-100-err.txt 
+   3553600               sflow                        sflow PSAMPLE sent           error  
+ 287101184               sflow                     sflow packets processed         error  
+   2127024               sflow                      sflow packets sampled          error  
+    350224               sflow                      sflow packets dropped          error  
+ 411199744               sflow                     sflow packets processed         error  
+   3043693               sflow                      sflow packets sampled          error  
+   1266893               sflow                      sflow packets dropped          error  
+```
+
+
+But this is starting to be a very nice implementation! With this iteration of the plugin, all the
+corruption is gone, there is a slight regression (because we're now actually _sending_ the
+messages). With the v3bis variant, only a tiny fraction of the sampels made it through to netlink.
+With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen
+FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying
+to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken,
+350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!
+
+Doign the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per
+interface. I can also see that the second interfae, which is doing L2XC and hits a much larger
+packets/sec throughput, is dropping more samples because it has an equal amount of time from main
+reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd
+out another. Slick.
+
+Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that
+main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so the
+`sflow PSAMPLE send failed` counter remains zero.
+
+{{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}}
+
+**Debrief**: In the mean time, Neil has been working on the `host-sflow` daemon changes to pick up
+these netlink messages. There's also a bit of work to do with retrieving the packet and byte
+counters of the VPP interfaces, so he is creating a module in `host-sflow` that can consume some
+messages from VPP. He will call this `mod_vpp`, and he mails a screenshot of his work in progress.
+I'll discuss the end-to-end changes with `hsflowd` in a followup article, and focus my efforts here
+on documenting the VPP parts only. But, as a teaser,  here's a screenshot of a validated
+`sflow-tool` output of a VPP instance using our `sFlow` plugin and his pending `host-sflow` changes
+to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably
+expensive to make mistakes.
+
+Neil reveals an itch that he has been meaning to scratch all this time. In VPP's
+`plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really
+most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To
+make use of some CPU instruction cache affinity, the loop that does this shoveling can do it one
+packet at a time, two packets at a time, or even four packets at a time. Although the code is super
+repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per
+packet, if you shovel four of them at a time.
+
+### v5: Quad Bucket Brigade in worker
+
+**TL/DR:** _9.68Mpps L3, 14.0Mpps L2XC, 11 CPU cycles/packet, 1.28e5 CPU cycles in main_
+
+Neil calls this the _Quad Bucket Brigade_, and one last finishing touch is to move from his default
+2-packet to a 4-packet shoveling. In commit
+[[285d8a0](https://github.com/sflow/vpp-sflow/commit/285d8a097b74bb38eeb14a922a1e8c1115da2ef2)], he
+extends a common pattern in VPP dataplane nodes, each time the node iterates, it'll pre-fetch now up
+to eight packets (`p0-p7`) if the vector is long enough, and handle them four at a time (`b0-b3`).
+He also adds a few compiler hints with branch prediction: almost no packets will have a trace
+enabled, so he can use `PREDICT_FALSE()` macros to allow the compiler to further optimize the code.
+
+I find reading the dataplane code, that it is incredibly ugly. But it's the price to pay for ultra
+fast throughput. But how do we see the effect? My low-tech proposal is to enable sampling at a very
+high rate, say 1:10'000'000, so that the code path that grabs and enqueues the sample into the FIFO
+is almost never called. Then, what's left for the `sFlow` dataplane node, really is to shovel the
+packets from `device-input` into `ethernet-input`.
+
+To measure the relative improvement, I do one test with, and one without commit
+[[285d8a09](https://github.com/sflow/vpp-sflow/commit/285d8a09)].
+
+```
+pim@hvn6-lab:~$ grep 'vector rates' v5-10M-runtime.txt  | grep -v 'in 0'
+  vector rates in 1.0981e7, out 1.0981e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 9.8806e6, out 9.8806e6, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4328e7, out 1.4328e7, drop 0.0000e0, punt 0.0000e0
+pim@hvn6-lab:~$ grep sflow v5-10M-runtime.txt 
+Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
+sflow-process-samples  any wait        0          0     28467  9.36e3            0.00
+sflow                  active    1158325  296531200         0  1.09e1          256.00
+sflow                  active    1679742  430013952         0  1.11e1          256.00
+
+pim@hvn6-lab:~$ grep 'vector rates' v5-noquadbrigade-10M-runtime.txt  | grep -v in\ 0
+  vector rates in 1.0959e7, out 1.0959e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 9.7046e6, out 9.7046e6, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4008e7, out 1.4008e7, drop 0.0000e0, punt 0.0000e0
+pim@hvn6-lab:~$ grep sflow v5-noquadbrigade-10M-runtime.txt 
+Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
+sflow-process-samples  any wait        0          0     28462  9.57e3            0.00
+sflow                  active    1137571  291218176         0  1.26e1          256.00
+sflow                  active    1641991  420349696         0  1.20e1          256.00
+```
+
+Would you look at that, this optimization actually works as advertised! There is a meaningful
+_progression_ from v5-noquadbrigade (9.70Mpps L3, 14.00Mpps L2XC) to v5 (9.88Mpps L3, 14.32Mpps
+L2XC).  So at the expense of adding 63 lines of code, there is a 2.8% increase in throughput.
+**Quad-Bucket-Brigade, yaay!**
+
+I'll leave you with a beautiful screenshot of the current code at HEAD, as it is sampling 1:100
+packets (!) on four interfaces, while forwarding 8x10G of 256 byte packets at line rate. You'll
+recall at the beginning of this article I did an acceptance loadtest with sFlow disabled, but this
+is the exact same result **with sFlow** enabled:
+
+{{< image src="/assets/sflow/trex-sflow-acceptance.png" alt="T-Rex sFlow Acceptance Loadtest" >}}
+
+This picture says it all: 79.98 Gbps in, 79.98 Gbps out; 36.22Mpps in, 36.22Mpps out. Also: 176k
+samples/sec taken from the dataplane, with correct rate limiting due to a per-worker FIFO depth
+limit, yielding 25k samples/sec sent to Netlink.
+
+## What's Next
+
+Checking in on the three main things we wanted to ensure with the plugin:
+
+1.  ✅ If `sFlow` is not enabled on a given interface, there is no regression on other interfaces.
+1.  ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average
+1.  ✅ If `sFlow` samples, it takes only marginally more CPU time to enqueue. No sampling gets
+9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput, while 1:1000 sampling reduces to 9.77Mpps of L3
+and 14.05Mpps of L2XC throughput. An overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.
+
+The hard part is done, but we're not entirely done yet. What's left is to implement a set of packet
+and byte counters, and send this information along with possible Linux CP data (such as the TAP
+interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about that
+part in a followup article.
+
+Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed
+folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the
+ecosystem. Our work so far is captured in Gerrit [[41680](https://gerrit.fd.io/r/c/vpp/+/41680)],
+which ends up being just over 2600 lines all-up. I do think we need to refactor a bit, add some
+VPP-specific tidbits like `FEATURE.yaml` and `*.rst` documentation, but this should be in reasonable
+shape.
+
+### Acknowledgements
+
+I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
+finer details such as logging, error handling, API specifications, and documentation. He has been a
+true pleasure to work with and learn from.
diff --git a/static/assets/sflow/hsflowd-demo.png b/static/assets/sflow/hsflowd-demo.png
new file mode 100644
index 0000000..6788e9e
--- /dev/null
+++ b/static/assets/sflow/hsflowd-demo.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:07bc2be08e8682a7e630abae448b8f44cd1a8a97351702d3ffc982b16fc49987
+size 1036446
diff --git a/static/assets/sflow/trex-acceptance.png b/static/assets/sflow/trex-acceptance.png
new file mode 100644
index 0000000..bbd33ac
--- /dev/null
+++ b/static/assets/sflow/trex-acceptance.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:4e5880a58d5d021e789ae80b5ecbe3e5afee0738baa89072e521d494d6497c77
+size 281910
diff --git a/static/assets/sflow/trex-sflow-acceptance.png b/static/assets/sflow/trex-sflow-acceptance.png
new file mode 100644
index 0000000..40a8b73
--- /dev/null
+++ b/static/assets/sflow/trex-sflow-acceptance.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:1ed1abbcdd77eee6260db4b0bf22e5a961990ee6db91bcdb72a68cb3ade4b131
+size 297538
diff --git a/static/assets/sflow/trex-v1.png b/static/assets/sflow/trex-v1.png
new file mode 100644
index 0000000..0a85b54
--- /dev/null
+++ b/static/assets/sflow/trex-v1.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b893e4c8405436fc1bb2268474875b4879ce9ef03c317ae8414324343bd060fa
+size 718915