Add sflow part 2

2024-10-06 18:07:22 +02:00
parent 255d3905d7
commit 3c69130cea
5 changed files with 556 additions and 0 deletions
--- a/content/articles/2024-10-06-sflow-2.md
+++ b/content/articles/2024-10-06-sflow-2.md
@ -0,0 +1,544 @@
 ---
 date: "2024-10-06T07:51:23Z"
 title: 'VPP with sFlow - Part 2'
 ---
 # Introduction
 {{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
 Last month, I picked up a project together with Neil McKee of [[inMon](https://inmon.com/)], the
 care takers of [[sFlow](https://sflow.org)]: an industry standard technology for monitoring high speed switched
 networks. `sFlow` gives complete visibility into the use of networks enabling performance optimization,
 accounting/billing for usage, and defense against security threats.
 The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
 forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
 [[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the so
 called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for a small
 portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but also in the
 VPP software dataplane, and then _transmit_ these samples using a Linux kernel feature called
 [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)]. This greatly
 reduces the complexity of code to be implemented in the forwarding path, while at the same time
 bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business logic for
 the more complex state keeping, packet marshalling and transmission from the _Agent_ to a central
 _Collector_.
 Last month, Neil and I discussed the proof of concept [[ref](https://github.com/sflow/vpp-sflow/)]
 and I described this in a [[first article]({{< ref 2024-09-08-sflow-1.md >}})]. Then, we iterated on
 the VPP plugin, playing with a few different approaches to strike a balance between performance, code
 complexity, and agent features. This article describes our journey.
 ## VPP: an sFlow plugin
 There are three things Neil and I specifically take a look at:
 1.  If `sFlow` is not enabled on a given interface, there should not be a regression on other
 interfaces.
 1.  If `sFlow` _is_ enabled, but a packet is not sampled, the overhead should be as small as
 possible, targetting single digit CPU cycles per packet in overhead.
 1.  If `sFlow` actually selects a packet for sampling, it should be moved out of the dataplane as
 quickly as possible, targetting double digit CPU cycles per sample.
 For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from
 a T-Rex loadtester on eight TenGig ports, which I have configured as follows:
 **1. RX Queue Placement**
 It's important that the network card that is receiving the traffic, gets serviced by a worker thread
 on the same NUMA domain. Since my machine has two CPUs (and thus, two NUMA nodes), I will align the
 NIC with the correct CPU, like so:
 ```
 set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0
 set interface rx-placement TenGigabitEthernet3/0/1 queue 0 worker 2
 set interface rx-placement TenGigabitEthernet3/0/2 queue 0 worker 4
 set interface rx-placement TenGigabitEthernet3/0/3 queue 0 worker 6
 set interface rx-placement TenGigabitEthernet130/0/0 queue 0 worker 1
 set interface rx-placement TenGigabitEthernet130/0/1 queue 0 worker 3
 set interface rx-placement TenGigabitEthernet130/0/2 queue 0 worker 5
 set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7
 ```
 **2. L3 IPv4/MPLS interfaces**
 I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a
 comparison with L3 IPv4 or MPLS running without `sFlow` (which I'll call the _baseline_ pair) and
 one which is running _with_ `sFlow` (which I'll call the _experiment_ pair).
 ```
 comment { L3: IPv4 interfaces }
 set int state TenGigabitEthernet3/0/0 up
 set int state TenGigabitEthernet3/0/1 up
 set int state TenGigabitEthernet130/0/0 up
 set int state TenGigabitEthernet130/0/1 up
 set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
 set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
 set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
 set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
 ip route add 16.0.0.0/24 via 100.64.0.0
 ip route add 48.0.0.0/24 via 100.64.1.0
 ip route add 16.0.2.0/24 via 100.64.4.0
 ip route add 48.0.2.0/24 via 100.64.5.0
 ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
 ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
 ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
 ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
 ```
 Here, the only specific trick worth mentioning is the use of `ip neighbor` to pre-populate the L2
 adjacency for the T-Rex loadtester. This way, VPP knows which MAC address to send traffic to, in
 case a packet has to be forwarded to 100.64.0.0 or 100.64.5.0. It avoids VPP from having to use ARP
 resolution.
 The configuration for an MPLS label switching router _LSR_ or also called _P-Router_ is added:
 ```
 comment { MPLS interfaces }
 mpls table add 0
 set interface mpls TenGigabitEthernet3/0/0 enable
 set interface mpls TenGigabitEthernet3/0/1 enable
 set interface mpls TenGigabitEthernet130/0/0 enable
 set interface mpls TenGigabitEthernet130/0/1 enable
 mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
 mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
 mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
 mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
 ```
 **3. L2 CrossConnect interfaces**
 Here, I will use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig
 interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison
 on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the
 _baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen.
 ```
 comment { L2 xconnected interfaces }
 set int state TenGigabitEthernet3/0/2 up
 set int state TenGigabitEthernet3/0/3 up
 set int state TenGigabitEthernet130/0/2 up
 set int state TenGigabitEthernet130/0/3 up
 set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
 set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
 set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
 set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
 ```
 **4. T-Rex Configuration**
 The Cisco T-Rex loadtester is running on another machine in the same rack. Physically, it has eight
 ports which are connected to a LAB switch, a cool Mellanox SN2700 running Debian [[ref]({{< ref
 2023-11-11-mellanox-sn2700.md >}})]. From there, eight ports go to my VPP machine. The LAB switch
 just has VLANs with two ports in each: VLAN 100 takes T-Rex port0 and connects it to TenGig3/0/0,
 VLAN 101 takes port1 and connects it to TenGig3/0/1, and so on. In total, sixteen ports and eight
 VLANs are used.
 The configuration for T-Rex then becomes:
 ```
 - version: 2
  interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
  port_info:
    - src_mac:  00:1b:21:06:00:00
      dest_mac: 9c:69:b4:61:a1:dc
    - src_mac:  00:1b:21:06:00:01
      dest_mac: 9c:69:b4:61:a1:dd
    - src_mac:  00:1b:21:83:00:00
      dest_mac: 00:1b:21:83:00:01
    - src_mac:  00:1b:21:83:00:01
      dest_mac: 00:1b:21:83:00:00
    - src_mac:  00:1b:21:87:00:00
      dest_mac: 9c:69:b4:61:75:d0
    - src_mac:  00:1b:21:87:00:01
      dest_mac: 9c:69:b4:61:75:d1
    - src_mac:  9c:69:b4:85:00:00
      dest_mac: 9c:69:b4:85:00:01
    - src_mac:  9c:69:b4:85:00:01
      dest_mac: 9c:69:b4:85:00:00
 ```
 Do you see how the first pair sends from `src_mac` 00:1b:21:06:00:00? That's the T-Rex side, and it
 encodes the PCI device `06:00.0` in the MAC address. It sends traffic to `dest_mac`
 9c:69:b4:61:a1:dc, which is the MAC address of VPP's TenGig3/0/0 interface. Looking back at the `ip
 neighbor` VPP config above, it becomes much easier to see who is sending traffic to whom.
 For L2XC, the MAC addresses don't matter. VPP will set the NIC in _promiscuous_ mode which means
 it'll accept any ethernet frame, not only those sent to the NIC's own MAC address. Therefore, in
 L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging
 connections and looking up FDB entries on the Mellanox switch much, much easier this way.
 With all config in place, I run a quick bidirectional loadtest using 256b packets at line rate,
 which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS, IPv4, and L2XC. Neat!
 {{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}}
 The name of the game is now to do a loadtest that shows the packet throughput and CPU cycles spent
 for each of the plugin iterations, comparing their performance on ports with and without `sFlow`
 enabled. For each iteration, I will use exactly the same VPP configuration, I will generate
 unidirectional 4x14.88Mpps of traffic with T-Rex, and I will report on VPP's performance in
 _baseline_ and a somewhat unfavorable 1:100 sampling rate.
 Ready? Here I go!
 ### v1: Workers send RPC to main
 ***TL/DR***: _13 cycles/packet on passthrough, 4.68Mpps L2, 3.26Mpps L3, with severe regression in
 baseline_
 The first iteration goes all the way back to a proof of concept from last year. It's described in
 detail in my [[first post]({{< ref 2024-09-08-sflow-1.md >}})]. The performance results are not
 stellar:
 *   ☢ When slamming a single sFlow enabled interface, _all interfaces_ regress. When sending 8Mpps
 of IPv4 traffic through an _baseline_ interface, that is an interface _without_ sFlow enabled, only
 5.2Mpps get through. This is considered a mortal sin in VPP-land.
 *   ✅ Passing through packets without sampling them, costs about 13 CPU cycles, not bad.
 *   ❌ Sampling a packet, specifically at higher rates (say, 1:100 or worse, 1:10) completely
 destroys throughput. When sending 4x14.88MMpps of traffic, only one third makes it through.
 Here's the bloodbath as seen from T-Rex:
 {{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}}
 **Debrief**: When talked throught these issues, we sort of drew the conclusion that it would be much
 faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the
 spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks
 are needed, and each worker thread will have its own producer queue.
 Then, we can create a separate thread (or even pool of threads), scheduling on possibly a different
 CPU (or in main), that runs a loop iterating on all sflow sample queues, consuming the samples and
 sending them in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too
 many coming in.
 ### v2: Workers send PSAMPLE directly
 **TL/DR**: _7.21Mpps IPv4 L3, 9.45Mpps L2XC, 87 cycles/packet, no impact on disabled interfaces_
 But before we do that, we have one curiosity itch to scratch - what if we sent the sample directly
 from the worker? With such a model, if it works, we will need no RPCs or sample queue at all. Of
 course, in this model any sample will have to be rewritten into a PSAMPLE packet and written via the
 netlink socket. It would be less complex, but not as efficient as it could be. One thing is prety
 certain, though: it should be much faster than sending an RPC to the main thread.
 After short refactor, Neil commits [[d278273](https://github.com/sflow/vpp-sflow/commit/d278273)],
 which adds compiler macros `SFLOW_SEND_FROM_WORKER` (v2) and `SFLOW_SEND_VIA_MAIN` (v1). When
 workers send directly, they will invoke `sflow_send_sample_from_worker()` instead of sending an RPC
 with `vl_api_rpc_call_main_thread()` in the previous version.
 The code currently uses `clib_warning()` to print stats from the dataplane, which is pretty
 expensive. We should be using the VPP logging framework, but for the time being, I add a few CPU
 counters so we can more accurately count the cummulative time spent for each part of the calls, see
 [[6ca61d2](https://github.com/sflow/vpp-sflow/commit/6ca61d2)]. I can now see these with `vppctl show
 err` instead.
 When loadtesting this, the deadly sin of impacting performance of interfaces that did not have
 `sFlow` enabled is gone. The throughput is not great, though. Instead of showing screenshots of
 T-Rex, I can also take a look at the throughput as measured by VPP itself. In its `show runtime`
 statistics, each worker thread shows both CPU cycles spent, as well as how many packets/sec it
 received and how many it transmitted:
 ```
 pim@hvn6-lab:~$ export C="v2-100"; vppctl clear run; vppctl clear err; sleep 30; \
                vppctl show run > $C-runtime.txt; vppctl show err > $C-err.txt
 pim@hvn6-lab:~$ grep 'vector rates' v2-100-runtime.txt  | grep -v 'in 0'
  vector rates in 1.0909e7, out 1.0909e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 7.2078e6, out 7.2078e6, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 9.4476e6, out 9.4476e6, drop 0.0000e0, punt 0.0000e0
 pim@hvn6-lab:~$ grep 'sflow' v2-100-runtime.txt 
 Name            State        Calls      Vectors  Suspends  Clocks    Vectors/Call  
 sflow           active      844916    216298496         0  8.69e1          256.00
 sflow           active     1107466    283511296         0  8.26e1          256.00
 pim@hvn6-lab:~$ grep -i sflow v2-100-err.txt 
 217929472               sflow                     sflow packets processed         error  
   1614519               sflow                      sflow packets sampled          error  
 2606893106               sflow                    CPU cycles in sent samples       error  
 280697344               sflow                     sflow packets processed         error  
   2078203               sflow                      sflow packets sampled          error  
 1844674406               sflow                    CPU cycles in sent samples       error  
 ```
 At a glance, I can see in the first `grep`, the in and out vector (==packet) rates for each worker
 thread that is doing meaningful work (ie. has more than 0pps of input). Remember that I pinned the
 RX queues to worker threads, and this now pays dividend: worker thread 0 is servicing TenGig3/0/0
 (as _even_ worker thread numbers are on NUMA domain 0), worker thread 1 is servicing TenGig130/0/0.
 What's cool about this, is it gives me an easy way to compare baseline L3 (10.9Mpps) with experiment
 L3 (7.21Mpps). Equally, L2XC comes in at 14.88Mpps in baseline and 9.45Mpps in experiment.
 Looking at the output of `vppctl show error`, I can learn another interesting detail. See how there
 are 1614519 sampled packets out of 217929472 processed packets (ie. a roughly 1:100 rate)? I added a
 CPU clock cycle counter that counts cummulative clocks spent once samples are taken. I can see that
 VPP spent 2606893106 CPU cycles sending these samples. That's **1615 CPU cycles** per sent sample,
 which is pretty terrible.
 **Debrief**: We both understand that assembling and `send()`ing the netlink messages from within the
 dataplane is a pretty bad idea. But it's great to see that removing the use of RPCs immediately
 improves performance on non-enabled interfaces, and we learned what the cost is of sending those
 samples. An easy step forward from here is to create a producer/consumer queue, where the workers
 can just copy the packet into a queue or ring buffer, and have an external `pthread` consume from
 the queue/ring in another thread that won't block the dataplane.
 ### v3: SVM FIFO from workers, dedicated PSAMPLE pthread
 **TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_
 Neil reports back after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)]
 that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty
 elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment
 called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new
 thread called `spt_process_samples` can then call `svm_fifo_dequeue()` from all workers' queues and
 pump those into Netlink.
 The overhead of copying the samples onto a VPP native `svm_fifo` seems to be two orders of magnitude
 lower than writing directly to Netlink, even though the `svm_fifo` library code has many bells and
 whistles that we don't need. But, perhaps due to these bells and whistles, we may be holding it
 wrong, as invariably after a short while the Netlink writes return _Message too long_ errors.
 ```
 pim@hvn6-lab:~$ grep 'vector rates' v3fifo-sflow-100-runtime.txt  | grep -v 'in 0'
  vector rates in 1.0783e7, out 1.0783e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 9.3499e6, out 9.3499e6, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4728e7, out 1.4728e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.3516e7, out 1.3516e7, drop 0.0000e0, punt 0.0000e0
 pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-runtime.txt 
 Name            State        Calls      Vectors  Suspends  Clocks    Vectors/Call  
 sflow           active     1096132    280609792         0  1.63e1          256.00
 sflow           active     1584577    405651712         0  1.46e1          256.00
 pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-err.txt 
 280635904               sflow                     sflow packets processed         error  
   2079194               sflow                      sflow packets sampled          error  
 733447310               sflow                    CPU cycles in sent samples       error  
 405689856               sflow                     sflow packets processed         error  
   3004118               sflow                      sflow packets sampled          error  
 1844674407               sflow                    CPU cycles in sent samples       error  
 ```
 Two things of note here. Firstly, the average clocks spent in the `sFlow` node have gone down from
 86 CPU cycles/packet to 16.3 CPU cycles. But even more importantly, the amount of time spent after
 the sample is taken is hugely reduced, from 1600+ cycles in v2 to a much more favorable 352 cycles
 in this version. Also, any risk of Netlink writes failing has been eliminated, because that's now
 offloaded to a different thread entirely.
 **Debrief**: It's not great that we created a new linux `pthread` for the consumer of the samples.
 VPP has an elaborate thread management system, and collaborative multitasking in its threading
 model, which adds introspection like clock counters, names, `show runtime`, `show threads` and so
 on. I can't help but wonder: wouldn't we just be able to move the `spt_process_samples()` thread
 into a VPP process node instead?
 ### v3bis: SVM FIFO, PSAMPLE process in Main
 **TL/DR:** _9.68Mpps L3, 14.10Mpps L2XC, 14.2 cycles/packet, still with corrupted FIFO queue messages_
 Neil agrees that there's no good reason to keep this out of main, and conjures up
 [[df2dab8d](https://github.com/vpp/sflow-vpp/df2dab8d)] which rewrites the thread to an
 `sflow_process_samples()` function, using `VLIB_REGISTER_NODE` to add it to VPP in an idiomatic way.
 As a really nice benefit, we can now count how many CPU cycles are spent, in _main_, each time this
 _process_ wakes up and does some work. It's a widely used pattern in VPP.
 Because of the FIFO queue message corruption, Netlink message are failing to send at an alarming
 rate, which is causing lots of `clib_warning()` messages to be spewed on console. I replace those
 with a counter of Failed Netlink messages instead, and commit refactor
 [[6ba4715](https://github.com/sflow/vpp-sflow/6ba4715d050f76cfc582055958d50bf4cc8a0ad1)].
 ```
 pim@hvn6-lab:~$ grep 'vector rates' v3bis-100-runtime.txt  | grep -v 'in 0'
  vector rates in 1.0976e7, out 1.0976e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 9.6743e6, out 9.6743e6, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4052e7, out 1.4052e7, drop 0.0000e0, punt 0.0000e0
 pim@hvn6-lab:~$ grep sflow v3bis-100-runtime.txt 
 Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
 sflow-process-samples  any wait        0          0     28052  4.66e4            0.00
 sflow                  active    1134102  290330112         0  1.42e1          256.00
 sflow                  active    1647240  421693440         0  1.32e1          256.00
 pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt 
     77945               sflow                        sflow PSAMPLE sent           error  
       863               sflow                    sflow PSAMPLE send failed        error  
 290376960               sflow                     sflow packets processed         error  
   2151184               sflow                      sflow packets sampled          error  
 421761024               sflow                     sflow packets processed         error  
   3119625               sflow                      sflow packets sampled          error  
 ```
 With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up
 and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using
 4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the 'sflow PSAMPLE send failed`
 counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.
 **Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All
 these send failures and corrupt packets are really messing things up. So while the provided FIFO
 implementation in `svm/fifo_segment.h` is idiomatic, it is also much more complex than we thought,
 and we're fearing that it may not be safe to read from another thread.
 ### v4: Custom lockless FIFO, PSAMPLE process in Main
 **TL/DR:** _9.56Mpps L3, 13.69Mpps L2XC, 15.6 cycles/packet, corruption fixed!_
 After reading around a bit in DPDK's
 [[kni_fifo](https://doc.dpdk.org/api-18.11/rte__kni__fifo_8h_source.html)], Neil produces a gem of a
 commit in
 [[42bbb64](https://github.com/sflow/vpp-sflow/commit/42bbb643b1f11e8498428d3f7d20cde4de8ee201)],
 where he introduces a tiny multiple-writer, single-consumer FIFO with two simple functions:
 `sflow_fifo_enqueue()` to be called in the workers, and `sflow_fifo_dequeue()` to be called in the
 main thread's `sflow-process-samples` process. He then makes this thread-safe by doing what I
 consider black magic, in commit
 [[dd8af17](https://github.com/sflow/vpp-sflow/commit/dd8af1722d579adc9d08656cd7ec8cf8b9ac11d6)],
 which makes use of `clib_atomic_load_acq_n()` and `clib_atomic_store_rel_n()` macros from VPP's
 `vppinfra/atomics.h`.
 What I really like about this change is that it introduces a FIFO implementation in about twenty
 lines of code, which means the sampling code path in the dataplane becomes really easy to follow,
 and will be even faster than it was before. I take it out for a loadtest:
 ```
 pim@hvn6-lab:~$ grep 'vector rates' v4-100-runtime.txt  | grep -v 'in 0'
  vector rates in 1.0958e7, out 1.0958e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 9.5633e6, out 9.5633e6, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.3697e7, out 1.3697e7, drop 0.0000e0, punt 0.0000e0
 pim@hvn6-lab:~$ grep sflow v4-100-runtime.txt 
 Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
 sflow-process-samples  any wait        0          0     17767  1.52e6            0.00
 sflow                  active    1121156  287015936         0  1.56e1          256.00
 sflow                  active    1605772  411077632         0  1.53e1          256.00
 pim@hvn6-lab:~$ grep sflow v4-100-err.txt 
   3553600               sflow                        sflow PSAMPLE sent           error  
 287101184               sflow                     sflow packets processed         error  
   2127024               sflow                      sflow packets sampled          error  
    350224               sflow                      sflow packets dropped          error  
 411199744               sflow                     sflow packets processed         error  
   3043693               sflow                      sflow packets sampled          error  
   1266893               sflow                      sflow packets dropped          error  
 ```
 But this is starting to be a very nice implementation! With this iteration of the plugin, all the
 corruption is gone, there is a slight regression (because we're now actually _sending_ the
 messages). With the v3bis variant, only a tiny fraction of the sampels made it through to netlink.
 With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen
 FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying
 to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken,
 350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!
 Doign the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per
 interface. I can also see that the second interfae, which is doing L2XC and hits a much larger
 packets/sec throughput, is dropping more samples because it has an equal amount of time from main
 reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd
 out another. Slick.
 Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that
 main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so the
 `sflow PSAMPLE send failed` counter remains zero.
 {{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}}
 **Debrief**: In the mean time, Neil has been working on the `host-sflow` daemon changes to pick up
 these netlink messages. There's also a bit of work to do with retrieving the packet and byte
 counters of the VPP interfaces, so he is creating a module in `host-sflow` that can consume some
 messages from VPP. He will call this `mod_vpp`, and he mails a screenshot of his work in progress.
 I'll discuss the end-to-end changes with `hsflowd` in a followup article, and focus my efforts here
 on documenting the VPP parts only. But, as a teaser,  here's a screenshot of a validated
 `sflow-tool` output of a VPP instance using our `sFlow` plugin and his pending `host-sflow` changes
 to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably
 expensive to make mistakes.
 Neil reveals an itch that he has been meaning to scratch all this time. In VPP's
 `plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really
 most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To
 make use of some CPU instruction cache affinity, the loop that does this shoveling can do it one
 packet at a time, two packets at a time, or even four packets at a time. Although the code is super
 repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per
 packet, if you shovel four of them at a time.
 ### v5: Quad Bucket Brigade in worker
 **TL/DR:** _9.68Mpps L3, 14.0Mpps L2XC, 11 CPU cycles/packet, 1.28e5 CPU cycles in main_
 Neil calls this the _Quad Bucket Brigade_, and one last finishing touch is to move from his default
 2-packet to a 4-packet shoveling. In commit
 [[285d8a0](https://github.com/sflow/vpp-sflow/commit/285d8a097b74bb38eeb14a922a1e8c1115da2ef2)], he
 extends a common pattern in VPP dataplane nodes, each time the node iterates, it'll pre-fetch now up
 to eight packets (`p0-p7`) if the vector is long enough, and handle them four at a time (`b0-b3`).
 He also adds a few compiler hints with branch prediction: almost no packets will have a trace
 enabled, so he can use `PREDICT_FALSE()` macros to allow the compiler to further optimize the code.
 I find reading the dataplane code, that it is incredibly ugly. But it's the price to pay for ultra
 fast throughput. But how do we see the effect? My low-tech proposal is to enable sampling at a very
 high rate, say 1:10'000'000, so that the code path that grabs and enqueues the sample into the FIFO
 is almost never called. Then, what's left for the `sFlow` dataplane node, really is to shovel the
 packets from `device-input` into `ethernet-input`.
 To measure the relative improvement, I do one test with, and one without commit
 [[285d8a09](https://github.com/sflow/vpp-sflow/commit/285d8a09)].
 ```
 pim@hvn6-lab:~$ grep 'vector rates' v5-10M-runtime.txt  | grep -v 'in 0'
  vector rates in 1.0981e7, out 1.0981e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 9.8806e6, out 9.8806e6, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4328e7, out 1.4328e7, drop 0.0000e0, punt 0.0000e0
 pim@hvn6-lab:~$ grep sflow v5-10M-runtime.txt 
 Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
 sflow-process-samples  any wait        0          0     28467  9.36e3            0.00
 sflow                  active    1158325  296531200         0  1.09e1          256.00
 sflow                  active    1679742  430013952         0  1.11e1          256.00
 pim@hvn6-lab:~$ grep 'vector rates' v5-noquadbrigade-10M-runtime.txt  | grep -v in\ 0
  vector rates in 1.0959e7, out 1.0959e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 9.7046e6, out 9.7046e6, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4008e7, out 1.4008e7, drop 0.0000e0, punt 0.0000e0
 pim@hvn6-lab:~$ grep sflow v5-noquadbrigade-10M-runtime.txt 
 Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
 sflow-process-samples  any wait        0          0     28462  9.57e3            0.00
 sflow                  active    1137571  291218176         0  1.26e1          256.00
 sflow                  active    1641991  420349696         0  1.20e1          256.00
 ```
 Would you look at that, this optimization actually works as advertised! There is a meaningful
 _progression_ from v5-noquadbrigade (9.70Mpps L3, 14.00Mpps L2XC) to v5 (9.88Mpps L3, 14.32Mpps
 L2XC).  So at the expense of adding 63 lines of code, there is a 2.8% increase in throughput.
 **Quad-Bucket-Brigade, yaay!**
 I'll leave you with a beautiful screenshot of the current code at HEAD, as it is sampling 1:100
 packets (!) on four interfaces, while forwarding 8x10G of 256 byte packets at line rate. You'll
 recall at the beginning of this article I did an acceptance loadtest with sFlow disabled, but this
 is the exact same result **with sFlow** enabled:
 {{< image src="/assets/sflow/trex-sflow-acceptance.png" alt="T-Rex sFlow Acceptance Loadtest" >}}
 This picture says it all: 79.98 Gbps in, 79.98 Gbps out; 36.22Mpps in, 36.22Mpps out. Also: 176k
 samples/sec taken from the dataplane, with correct rate limiting due to a per-worker FIFO depth
 limit, yielding 25k samples/sec sent to Netlink.
 ## What's Next
 Checking in on the three main things we wanted to ensure with the plugin:
 1.  ✅ If `sFlow` is not enabled on a given interface, there is no regression on other interfaces.
 1.  ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average
 1.  ✅ If `sFlow` samples, it takes only marginally more CPU time to enqueue. No sampling gets
 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput, while 1:1000 sampling reduces to 9.77Mpps of L3
 and 14.05Mpps of L2XC throughput. An overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.
 The hard part is done, but we're not entirely done yet. What's left is to implement a set of packet
 and byte counters, and send this information along with possible Linux CP data (such as the TAP
 interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about that
 part in a followup article.
 Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed
 folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the
 ecosystem. Our work so far is captured in Gerrit [[41680](https://gerrit.fd.io/r/c/vpp/+/41680)],
 which ends up being just over 2600 lines all-up. I do think we need to refactor a bit, add some
 VPP-specific tidbits like `FEATURE.yaml` and `*.rst` documentation, but this should be in reasonable
 shape.
 ### Acknowledgements
 I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
 finer details such as logging, error handling, API specifications, and documentation. He has been a
 true pleasure to work with and learn from.
--- a/static/assets/sflow/hsflowd-demo.png
+++ b/static/assets/sflow/hsflowd-demo.png
--- a/static/assets/sflow/trex-acceptance.png
+++ b/static/assets/sflow/trex-acceptance.png
--- a/static/assets/sflow/trex-sflow-acceptance.png
+++ b/static/assets/sflow/trex-sflow-acceptance.png
--- a/static/assets/sflow/trex-v1.png
+++ b/static/assets/sflow/trex-v1.png