From 08d55e6ac05348dbf5da826c85891c8cedfe9a5f Mon Sep 17 00:00:00 2001
From: Pim van Pelt <pim@ipng.ch>
Date: Mon, 9 Sep 2024 10:26:15 +0200
Subject: [PATCH] sFlow, part 1

---
 content/articles/2024-09-08-sflow-1.md | 701 +++++++++++++++++++++++++
 static/assets/sflow/sflow.gif          |   3 +
 static/assets/sflow/trex-baseline.png  |   3 +
 static/assets/sflow/trex-overload.png  |   3 +
 static/assets/sflow/trex-passthru.png  |   3 +
 5 files changed, 713 insertions(+)
 create mode 100644 content/articles/2024-09-08-sflow-1.md
 create mode 100644 static/assets/sflow/sflow.gif
 create mode 100644 static/assets/sflow/trex-baseline.png
 create mode 100644 static/assets/sflow/trex-overload.png
 create mode 100644 static/assets/sflow/trex-passthru.png

diff --git a/content/articles/2024-09-08-sflow-1.md b/content/articles/2024-09-08-sflow-1.md
new file mode 100644
index 0000000..94d8c81
--- /dev/null
+++ b/content/articles/2024-09-08-sflow-1.md
@@ -0,0 +1,701 @@
+---
+date: "2024-09-08T12:51:23Z"
+title: 'VPP with sFlow - Part 1'
+---
+
+# Introduction
+
+{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
+
+In January of 2023, an uncomfortably long time ago at this point, an acquaintance of mine called
+Ciprian reached out to me after seeing my [[DENOG
+#14](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)] presentation. He was interested to learn about
+IPFIX and was asking if sFlow would be an option.  At the time, there was a plugin in VPP called
+[[flowprobe](https://s3-docs.fd.io/vpp/24.10/cli-reference/clis/clicmd_src_plugins_flowprobe.html)]
+which is able to emit IPFIX records.  Unfotunately I never really got it to work well in my tests,
+as either the records were corrupted, sub-interfaces didn't work, or the plugin would just crash the
+dataplane entirely. In the meantime, the folks at [[Netgate](https://netgate.com/)] submitted quite
+a few fixes to flowprobe, but it remains an expensive operation computationally. Wouldn't copying
+one in a thousand or ten thousand packet headers with flow _sampling_ not be just as good?
+
+In the months that followed, I discussed the feature with the incredible folks at
+[[inMon](https://inmon.com/)], the original designers and maintainers of the sFlow protocol and
+toolkit. Neil from inMon wrote a prototype and put it on [[GitHub](https://githubn.com/sflow/vpp)]
+but for lack of time I didn't manage to get it to work.
+
+However, I have a bit of time on my hands in September and October, and just a few weeks ago,
+my buddy Pavel from [[FastNetMon](https://fastnetmon.com/)] pinged that very dormant thread about
+sFlow being a potentially useful tool for anti DDoS protection using VPP. And I very much agree!
+
+## sFlow: Protocol
+
+Maintenance of the protocol is performed by the [[sFlow.org](https://sflow.org/)] consortium, the
+authoritative source of the sFlow protocol specifications. The current version of sFlow is v5.
+
+sFlow, short for _sampled Flow_, works at the ethernet layer of the stack, where it inspects one in
+N datagrams (typically 1:1000 or 1:10000) going through the physical network interfaces of a device.
+On the device, an **sFlow Agent** does the sampling. For each sample the Agent takes, the first M
+bytes (typically 128) are copied into an sFlow Datagram. Sampling metadata is added, such as
+the ingress (or egress) interface and sampling process parameters. The Agent can then optionally add
+forwarding information (such as router source- and destination prefix, and what-not, MPLS LSP
+information, BGP communties, and what-not). Finally the Agent will will periodically read the octet
+and packet counters of the physical network interface. Ultimately, the Agent will send the samples
+and additional information to an **sFlow Collector** for further processing.
+
+sFlow has been specifically designed to take advantages of the statistical properties of packet
+sampling and can be modeled using statistical sampling theory. This means that the sFlow traffic
+monitoring system will always produce statistically quantifiable measurements. You can read more
+about it in Peter Phaal and Sonia Panchen's
+[[paper](https://sflow.org/packetSamplingBasics/index.htm)], I certainly did :)
+
+### sFlow: Netlink PSAMPLE
+
+sFlow is meant to be a very _lightweight_ operation for the sampling equipment. It can typically be
+done in hardware, but there also exist several software implementations. One very clever thing, I
+think, is decoupling the sampler from the rest of the Agent. The Linux kernel has a packet sampling
+API called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)], which
+allows _producers_ to send samples to a certain _group_, and then allows _consumers_ to subscribe to
+samples of a certrain _group_. The PSAMPLE API uses
+[[NetLink](https://docs.kernel.org/userspace-api/netlink/intro.html)] under the covers. The cool
+thing, for me anyway, is that I have a little bit of experience with Netlink due to my work on VPP's
+[[Linux Control Plane]({{< ref 2021-08-25-vpp-4 >}})] plugin.
+
+The idea here is that some **sFlow Agent**, notably a VPP plugin, will be taking periodic samples
+from the network interfaces, and generating Netlink messages. Then, some other program, notably
+outside of VPP, can subscribe to these messages and further handle them, creating UDP packets with
+sFlow samples and counters and other information, and sending them to an **sFlow Collector**
+somewhere else on the network.
+
+{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Warning" >}}
+
+There's a handy utility called [[psampletest](https://github.com/sflow/psampletest)] which can
+subscribe to these PSAMPLE netlink groups and retrieve the samples. The first time I used all of
+this stuff, I kept on getting errors. It turns out, there's a kernel module that needs to be loaded:
+`modprobe psample` and make sure it's in `/etc/modules` before you spend as many hours as I did
+pulling out hair.
+
+## VPP: sFlow Plugin
+
+For the purposes of my initial testing, I'll simply take a look at Neil's prototype on
+[[GitHub](https://githubn.com/sflow/vpp)] and see what I learn in terms of functionality and
+performance.
+
+### sFlow Plugin: Anatomy
+
+The design is purposefully minimal, to do all of the heavy lifting outside of the VPP dataplane. The
+plugin will create a new VPP _graph node_ called `sflow`, which the operator can insert after
+`device-input`, in other words, if enabled, the plugin will get a copy of all packets that are read
+from an input provider, such as `dpdk-input` or `rdma-input`. The plugin's job is to process the
+packet, and if it's not selected for sampling, just move it onwards to the next node, typically
+`ethernet-input`. 
+
+The kicker is, that one in N packets will be selected to sample, after which:
+1.   the ethernet header (`*en`) is extracted from the packet
+1.   the input interface (`hw_if_index`) is extracted from the VPP buffer
+1.   if there are too many samples from this worker thread being worked on, it is discarded and an
+     error counter is incremented. This protects the main thread from being slammed with samples if
+     there are simply too many being fished out of the dataplane.
+1.   Otherwise:
+     *   a new `sflow_sample_t` is created, with all the sampling process metadata filled in
+     *   the first 128 bytes of the packet are copied into the sample
+     *   an RPC is dispatched to the main thread, which will send the sample to the PSAMPLE channel
+
+Both a debug CLI command and API call are added:
+
+```
+sflow enable-disable <interface-name> [<sampling_N>]|[disable]
+```
+
+Some observations:
+
+First off, the sampling_N in Neil's demo is a global rather than per-interface setting. It would
+make sense to make this be per-interface, as routers typically have a mixture of 1G/10G and faster
+100G network cards available. This is not going to be an issue to retrofit.
+
+{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
+
+Secondly, sending the RPC to main uses `vl_api_rpc_call_main_thread()`, which
+requires a _spinlock_ in `src/vlibmemory/memclnt_api.c:649`. I'm somewhat worried that when many
+samples are sent from many threads, there will be lock contention and performance will nose dive.
+
+### sFlow Plugin: Functional
+
+I boot up the [[IPng Lab]({{< ref 2022-10-14-lab-1 >}})] and install a bunch of sFLow tools on it,
+make sure the `psample` kernel module is loaded.  In this first test I'll take a look at
+tablestakes. I compile VPP with the sFlow plugin, and enable that plugin in `startup.conf` on each
+of the four VPP routers.  For reference, the Lab looks like this:
+
+{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
+
+What I'll do is start an `iperf3` server on `vpp0-3` and then hit it from `vpp0-0`, to generate
+a few TCP traffic streams back and forth, which will be traversing `vpp0-2` and `vpp0-1`, like so:
+
+```
+pim@vpp0-3:~ $ iperf3 -s -D
+pim@vpp0-0:~ $ iperf3 -c vpp0-3.lab.ipng.ch -t 86400 -P 10 -b 10M
+```
+
+### Configuring VPP for sFlow
+
+While this `iperf3` is running, I'll log on to `vpp0-2` to take a closer look. The first thing I do,
+is turn on packet sampling on `vpp0-2`'s interface that points at `vpp0-3`, which is `Gi10/0/1`, and
+the interface that points at `vpp0-0`, which is `Gi10/0/0`. That's easy enough, and I will use a
+sampling rate of 1:1000 as these interfaces are GigabitEthernet:
+
+```
+root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/0 1000
+root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/1 1000
+root@vpp0-2:~# vppctl show run | egrep '(Name|sflow)'
+       Name         State       Calls       Vectors  Suspends      Clocks     Vectors/Call  
+sflow              active        5656         24168         0      9.01e2     4.27
+```
+
+Nice! VPP inserted the `sflow` node between `dpdk-input` and `ethernet-input` where it can do its
+business. But is it sending data? To answer this question, I can first take a look at the
+`psampletest` tool:
+
+```
+root@vpp0-2:~# psampletest 
+pstest: modprobe psample returned 0
+pstest: netlink socket number = 1637
+pstest: getFamily
+pstest: generic netlink CMD = 1
+pstest: generic family name: psample
+pstest: generic family id: 32
+pstest: psample attr type: 4 (nested=0) len: 8
+pstest: psample attr type: 5 (nested=0) len: 8
+pstest: psample attr type: 6 (nested=0) len: 24
+pstest: psample multicast group id: 9
+pstest: psample multicast group: config
+pstest: psample multicast group id: 10
+pstest: psample multicast group: packets
+pstest: psample found group packets=10
+pstest: joinGroup 10
+pstest: received Netlink ACK
+pstest: joinGroup 10
+pstest: set headers...
+pstest: serialize...
+pstest: print before sending...
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=7 out=9 n=1000 seq=1 pktlen=1514 hdrlen=31 pkt=0x558c08ba4958 q=3 depth=33333333 delay=123456
+pstest: send...
+pstest: send_psample getuid=0 geteuid=0
+pstest: sendmsg returned 140
+pstest: free...
+pstest: start read loop...
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=1 out=0 n=1000 seq=600320 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=1 out=0 n=1000 seq=600321 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=1 out=0 n=1000 seq=600322 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=2 out=0 n=1000 seq=600423 pktlen=66 hdrlen=70 pkt=0x7ffdb0d5a1e8 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=1 out=0 n=1000 seq=600324 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
+```
+
+I am amazed! The `psampletest` output shows a few packets, considering I'm asking `iperf3` to push
+100Mbit using 9000 byte jumboframes (which would be something like 1400 packets/second), I can
+expect two or three samples per second. I immediately notice a few things:
+
+***1. Network Namespae***: The Netlink sampling channel belongs to a network _namespace_. The VPP
+process is running in the _default_ netns, so its PSAMPLE netlink messages will be in that namespace. 
+Thus, the `psampletest` and other tools must also run in that namespace. I mention this because in
+Linux CP, often times the controlplane interfaces are created in a dedicated `dataplane` network
+namespace.
+
+***2. pktlen and hdrlen***: The pktlen is off, and this is a bug. In VPP, packets are put into buffers
+of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for the
+same packet. The packet length here ought to be 9000 in one direction. Looking at the `in=2` packet
+with length 66, that looks like a legitimate ACK packet on the way back. Byt ehy is the hdrlen set
+to 70 there? I'm going to want to ask Neil about that.
+
+***3. ingress and egress***: The `in=1` and one packet with `in=2` represent the input `hw_if_index`
+which is the ifIndex that VPP assigns to its devices. And looking at `show interfaces`, indeed
+number 1 corresponds with `GigabitEthernet10/0/0` and 2 is `GigabitEthernet10/0/1`, which checks
+out:
+```
+root@vpp0-2:~# vppctl show int
+              Name          Idx    State  MTU (L3/IP4/IP6/MPLS)     Counter          Count     
+GigabitEthernet10/0/0         1      up          9000/0/0/0     rx packets             469552764
+                                                                rx bytes           4218754400233
+                                                                tx packets             133717230
+                                                                tx bytes              8887341013
+                                                                drops                       6050
+                                                                ip4                    469321635
+                                                                ip6                       225164
+GigabitEthernet10/0/1         2      up          9000/0/0/0     rx packets             133527636
+                                                                rx bytes              8816920909
+                                                                tx packets             469353481
+                                                                tx bytes           4218736200819
+                                                                drops                       6060
+                                                                ip4                    133489925
+                                                                ip6                        29139
+
+```
+
+***4. ifIndexes are orthogonal***: These `in=1` or `in=2` ifIndex numbers are constructs of the VPP
+dataplane.  Notably, VPP's numbering of interface index is strictly _orthogonal_ to Linux, and it's
+not even guaranteed that there even _exists_ an interface in Linux for the PHY upon which the
+sampling is happening. Said differently, `in=1` here is meant to reference VPP's
+`GigabitEthernet10/0/0` interface, but in Linux, `ifIndex=1` is a completely different interface
+(`lo`) in the default network namespace. Similarly `in=2` for VPP's `Gi10/0/1` interface corresponds
+to interface `enp1s0` in Linux:
+
+```
+root@vpp0-2:~# ip link
+1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
+    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
+2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
+    link/ether 52:54:00:f0:01:20 brd ff:ff:ff:ff:ff:ff
+```
+
+***5. Counters***: sFlow periodically polls the interface counters for all interfaces. It will
+normally use `/proc/net/` entries for that, but there are two problems with this:
+1.   There may not exist a Linux representation of the interface, for example if it's only doing L2
+     bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane
+     interface.
+1.   Even if it does exist (and it's the "correct" ifIndex in Linux), the statistics counters in the
+     Linux representation will only count packets and octets of _punted_ packets, that is to say,
+     the stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device.
+     Important to note that east-west traffic that goes _through_ the dataplane, is never punted to
+     Linux, and as such, the counters will be undershooting: only counting traffic _to_ the router,
+     not _through_ the router.
+
+
+### VPP sFlow: Performance
+
+Now that I've shown that Neil's proof of concept works, I will take a better look at the performance
+of the plugin. I've made a mental note that the plugin sends RPCs from worker threads to the main
+thread to marshall the PSAMPLE messages out. I'd like to see how expensive that is, in general. So,
+I pull boot two Dell R730 machines in IPng's Lab and put them to work. The first machine will run
+Cisco's T-Rex loadtester with 8x 10Gbps ports (4x dual Intel 58299), while the second (identical)
+machine will run VPP also ith 8x 10Gbps ports (2x Intel i710-DA4).
+
+I will test a bunch of things in parallel. First off, I'll test L2 (xconnect) and L3 (IPv4 routing),
+and secondly I'll test that with and without sFlow turned on. This gives me 8 ports to configure,
+and I'll start with the L2 configuration, as follows:
+
+```
+vpp# set int state TenGigabitEthernet3/0/2 up
+vpp# set int state TenGigabitEthernet3/0/3 up
+vpp# set int state TenGigabitEthernet130/0/2 up
+vpp# set int state TenGigabitEthernet130/0/3 up
+vpp# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
+vpp# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
+vpp# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
+vpp# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
+```
+
+Then, the L3 configuration looks like this:
+
+```
+vpp# lcp create TenGigabitEthernet3/0/0 host-if xe0-0
+vpp# lcp create TenGigabitEthernet3/0/1 host-if xe0-1
+vpp# lcp create TenGigabitEthernet130/0/0 host-if xe1-0
+vpp# lcp create TenGigabitEthernet130/0/1 host-if xe1-1
+vpp# set int state TenGigabitEthernet3/0/0 up
+vpp# set int state TenGigabitEthernet3/0/1 up
+vpp# set int state TenGigabitEthernet130/0/0 up
+vpp# set int state TenGigabitEthernet130/0/1 up
+vpp# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
+vpp# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
+vpp# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
+vpp# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
+vpp# ip route add 16.0.0.0/24 via 100.64.0.0
+vpp# ip route add 48.0.0.0/24 via 100.64.1.0
+vpp# ip route add 16.0.2.0/24 via 100.64.4.0
+vpp# ip route add 48.0.2.0/24 via 100.64.5.0
+vpp# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
+vpp# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
+vpp# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
+vpp# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
+```
+
+And finally, the Cisco T-Rex configuration looks like this:
+
+```
+- version: 2
+  interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
+  port_info:
+    - src_mac:  00:1b:21:06:00:00
+      dest_mac: 9c:69:b4:61:a1:dc
+    - src_mac:  00:1b:21:06:00:01
+      dest_mac: 9c:69:b4:61:a1:dd
+
+    - src_mac:  00:1b:21:83:00:00 
+      dest_mac: 00:1b:21:83:00:01
+    - src_mac:  00:1b:21:83:00:01
+      dest_mac: 00:1b:21:83:00:00
+
+    - src_mac:  00:1b:21:87:00:00
+      dest_mac: 9c:69:b4:61:75:d0
+    - src_mac:  00:1b:21:87:00:01
+      dest_mac: 9c:69:b4:61:75:d1
+
+    - src_mac:  9c:69:b4:85:00:00
+      dest_mac: 9c:69:b4:85:00:01
+    - src_mac:  9c:69:b4:85:00:01
+      dest_mac: 9c:69:b4:85:00:00
+```
+
+A little note on the use of `ip neighbor` in VPP and specific `dest_mac` in T-Rex. In L2 mode,
+because the VPP interfaces will be in promiscuous mode and simply pass through any ethernet frame
+received on interface `Te3/0/2` and copy it out on `Te3/0/3` and vice-versa, there is no need to
+tinker with MAC addresses. But in L3 mode, the NIC will only accept ethernet frames addressed to its
+MAC address, so you can see that for the first port in T-Rex, I am setting `dest_mac:
+9c:69:b4:61:a1:dc` which is the MAC address of `Te3/0/0` on VPP. And then on the way out, if VPP
+wants to send traffic back to T-Rex, I'll give it a static ARP entry with `ip neighbor .. static`.
+
+With that said, I can start a baseline loadtest like so:
+{{< image width="100%" src="/assets/sflow/trex-baseline.png" alt="Cisco T-Rex: baseline" >}}
+
+T-Rex is sending 10Gbps out on all eight interfaces (four of which are L3 routing, and four of which
+are L2 xconnecting), using packet size of 1514 bytes. This amounts of roughlu 813Kpps per port, or a
+cool 6.51Mpps in total. And I can see, in this base line configuration, the VPP router is happy to
+do the work.
+
+I then enable sFlow on the second set of four ports, using a 1:1000 sampling rate:
+
+```
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000
+```
+
+This should yield about 3'250 or so samples per second, and things look pretty great:
+
+```
+pim@hvn6-lab:~$ vppctl show err
+   Count                  Node                              Reason               Severity 
+   5034508               sflow                     sflow packets processed         error  
+      4908               sflow                      sflow packets sampled          error  
+   5034508               sflow                     sflow packets processed         error  
+      5111               sflow                      sflow packets sampled          error  
+   5034516             l2-output                      L2 output packets            error  
+   5034516              l2-input                       L2 input packets            error  
+   5034404               sflow                     sflow packets processed         error  
+      4948               sflow                      sflow packets sampled          error  
+   5034404             l2-output                      L2 output packets            error  
+   5034404              l2-input                       L2 input packets            error  
+   5034404               sflow                     sflow packets processed         error  
+      4928               sflow                      sflow packets sampled          error  
+   5034404             l2-output                      L2 output packets            error  
+   5034404              l2-input                       L2 input packets            error  
+   5034516             l2-output                      L2 output packets            error  
+   5034516              l2-input                       L2 input packets            error  
+```
+
+I can see that the `sflow packets sampled` is roughly 0.1% of the `sflow packets processed` which
+checks out. I can also see in `psampletest` a flurry of activity, so I'm happy:
+
+```
+pim@hvn6-lab:~$ sudo psampletest 
+...
+pstest: grp=1 in=9 out=0 n=1000 seq=63388 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=8 out=0 n=1000 seq=63389 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=11 out=0 n=1000 seq=63390 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=10 out=0 n=1000 seq=63391 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=11 out=0 n=1000 seq=63392 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
+```
+
+and yes, all four `in` interfaced (8, 9, 10 and 11) are sending samples, and those indexes correctly
+correspond to the VPP dataplane's `sw_if_index` for `TenGig130/0/0 - 3`. Sweet! On this machine,
+each TenGig network interface has its own dedicated VPP worker thread.  Considering I turned on
+sFlow sampling on four interfaces, I should see the cost I'm paying for the feature:
+
+```
+pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
+Name                 State         Calls       Vectors Suspends      Clocks     Vectors/Call  
+sflow               active       3908218      14350684        0      9.05e1     3.67
+sflow               active       3913266      14350680        0      1.11e2     3.67
+sflow               active       3910828      14350687        0      1.08e2     3.67
+sflow               active       3909274      14350692        0      5.66e1     3.67
+```
+
+Alright, so for the 999 packets that went through and the one packet that got sampled, on average
+VPP is spending between 90 and 111 CPU cycles per packet, and the loadtest looks squeaky clean on
+T-Rex.
+
+### VPP sFlow: Cost of passthru
+
+I decide to take a look at two edge cases. What if there are no samples being taken at all, and the
+`sflow` node is merely passing through all packets to `ethernet-input`? To simulate this, I will set
+up a hizarrely high sampling rate, say one in ten million. I'll also make the T-Rex loadtester use
+only four ports, in other words, a unidirectional loadtest, and I'll make it go much faster by
+sending smaller packets, say 128 bytes:
+
+```
+tui>start -f stl/ipng.py -p 0 2 4 6 -m 99% -t size=128
+
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000 disable
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000 disable
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000 disable
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000 disable
+
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10000000
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10000000
+```
+
+The laodtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the
+`sFlow` plugin is not sampling many packets:
+
+```
+pim@hvn6-lab:~$ vppctl show err
+   Count                  Node                              Reason               Severity 
+  59777084               sflow                     sflow packets processed         error  
+         6               sflow                      sflow packets sampled          error  
+  59777152             l2-output                      L2 output packets            error  
+  59777152              l2-input                       L2 input packets            error  
+  59777104               sflow                     sflow packets processed         error  
+         6               sflow                      sflow packets sampled          error  
+  59777104             l2-output                      L2 output packets            error  
+  59777104              l2-input                       L2 input packets            error  
+
+pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
+Name                 State         Calls       Vectors Suspends      Clocks     Vectors/Call  
+sflow                active      8186642     369674664        0      1.35e1     45.16
+sflow                active     25173660     369674696        0      1.97e1     14.68
+```
+Two observations:
+
+1.   One of these is busier than the other. Without looking further, I can already predict that the
+top one (doing 45.16 vectors/call) is the L3 thread. Reasoning: the L3 code path through the
+dataplane is a lot more expensive than 'merely' L2 XConnect. As such, the packets will spend more
+time, and therefore the iterations of the `dpdk-input` loop will be further apart in time.  And
+because of that, it'll end up consuming more packets on each subsequent call, in order to catch up.
+The L2 path on the other hand, is quicker and therefore will have less packets waiting on subsequent
+calls to `dpdk-input`.
+2.   The `sfloww` plugin spends between 13.5 and 19.7 CPU cycles shoveling the packets into
+`ethernet-input` without doing anything to them. That's pretty low! And the L3 path is a little bit
+more efficient per packet, which is very likely because it gets to amort its L1/L2 CPU instruction
+cache over 45 packets each time it runs, while the L2 path can only amort its instruction cache over
+15 or so packets each time it runs.
+
+I let the loadtest run overnight,and the proof is in the pudding: sFlow enabled but not sampling
+works just fine:
+
+{{< image width="100%" src="/assets/sflow/trex-passthru.png" alt="Cisco T-Rex: passthru" >}}
+
+### VPP sFlow: Cost of sampling
+
+The other interesting case is to figure out how much CPU it takes to execute the code path
+with the actual sampling. This one turns out a bit trickier to measure. While leaving the previous
+loadtest running at 33.5Mpps, I disable sFlow and then re-enable it at an abnormally _high_ ratio of
+1:10 packets:
+
+```
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 disable
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 disable
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10
+```
+
+The T-Rex view immediately reveals that VPP is not doing very well, as the throughput went from
+33.5Mpps all the way down to 7.5Mpps. Ouch! Looking at the dataplane:
+
+```
+pim@hvn6-lab:~$ vppctl show err | grep sflow
+ 340502528               sflow                     sflow packets processed         error  
+  12254462               sflow                      sflow packets dropped          error  
+  22611461               sflow                      sflow packets sampled          error  
+ 422527140               sflow                     sflow packets processed         error  
+   8533855               sflow                      sflow packets dropped          error  
+  34235952               sflow                      sflow packets sampled          error  
+```
+
+Ha, this new safeguard popped up: remember all the way at the beginning, I explained how there's a
+safety net in the `sflow` plugin that will pre-emptively drop the sample if the RPC channel towards
+the main thread is seeing too many outstanding RPCs? That's happening right now, under the moniker
+`sflow packets dropped`, and it's roughly *half* of the samples.
+
+My first attempt is to back off the loadtester to roughly 1.5Mpps per port (so 6Mpps in total, under the
+current limit of 7.5Mpps), but I'm disappointed: the VPP instance is now returning 665Kpps per port
+only, which is horrible, and it's still dropping samples.
+
+My second attempt is to turn off all ports but last pair (the L2XC port), which returns 930Kpps from
+the offered 1.5Mpps. VPP is clearly not having a good time here.
+
+Finally, as a validation, I turn off all ports but the first pair (the L3 port, without sFlow), and
+ramp up the traffic to 8Mpps. Success (unsurprising to me). I also ramp up the second pair (the L2XC
+port, without sFlow), VPP forwards all 16Mpps and is happy again.
+
+Once I turn on the third pair (the L3 port, _with_ sFlow), even at 1Mpps, the whole situation
+regresses again: First two ports go down from 8Mpps to 5.2Mpps each; the third (offending) port
+delivers 740Kpps out of 1Mpps. Clearly, there's some work to do under high load situations!
+
+#### Reasoning about the bottle neck
+
+But how expensive is sending samples, really? To try to get at least some pseudo-scientific answer I
+turn off all ports again, and ramp up the one port pair with (L3 + sFlow at 1:10 ratio) to full line
+rate: that is 64 byte packets at 14.88Mpps:
+
+```
+tui>stop
+tui>start -f stl/ipng.py -m 100% -p 4 -t size=64
+```
+
+VPP is now on the struggle bus and is returning 3.17Mpps or 21% of that. But, I think it'll give me
+some reasonable data to try to feel out where the bottleneck is.
+
+```
+Thread 2 vpp_wk_1 (lcore 3)
+Time 6.3, 10 sec internal node vector rate 256.00 loops/sec 27310.73
+  vector rates in 3.1607e6, out 3.1607e6, drop 0.0000e0, punt 0.0000e0
+             Name                 State     Calls    Vectors   Suspends   Clocks   Vectors/Call  
+TenGigabitEthernet130/0/1-outp   active     77906   19943936          0   5.79e0         256.00
+TenGigabitEthernet130/0/1-tx     active     77906   19943936          0   6.88e1         256.00
+dpdk-input                       polling    77906   19943936          0   4.41e1         256.00
+ethernet-input                   active     77906   19943936          0   2.21e1         256.00
+ip4-input                        active     77906   19943936          0   2.05e1         256.00
+ip4-load-balance                 active     77906   19943936          0   1.07e1         256.00
+ip4-lookup                       active     77906   19943936          0   1.98e1         256.00
+ip4-rewrite                      active     77906   19943936          0   1.97e1         256.00
+sflow                            active     77906   19943936          0   6.14e1         256.00
+
+pim@hvn6-lab:pim# vppctl show err | grep sflow
+ 551357440               sflow                     sflow packets processed         error  
+  19829380               sflow                      sflow packets dropped          error  
+  36613544               sflow                      sflow packets sampled          error  
+```
+
+OK, the `sflow` plugin saw 551M packets, selected 36.6M of them for sampling, but ultimately only
+sent RPCs to the main thread for 19.8M of them. There are three code paths:
+
+1.   Super cheap: pass through. I already learned that it takes about X=13.5 CPU cycles to pass
+through a packet.
+1.   Very cheap: select sample and construct the RPC, but toss it, costing Y CPU cycles.
+1.   Expensive: select sample, and send the RPC. Z CPU cycles in worker, and another amount in main.
+
+Now I don't know what Y is, but seeing as the selection only copies some data from the VPP buffer
+into a new `sflow_sample_t`, and it uses `clip_memcpy_fast()` for the sample header, I'm going to
+assume it's not _drastically_ more expensive than the super cheap case, so for simplicity I'll
+guesstimate that it takes Y=20 CPU cyces.
+
+With that guess out of the way, I can see what the `sflow` plugin is consuming for the third case:
+
+```
+AvgClocks = (Total * X + Sampled * Y + RPCSent * Z) / Total
+
+61.4 = ( 551357440 * 13.5 + 36613544 * 20 + (36613544-19829380) * Z ) / 551357440
+61.4 = ( 7443325440 + 732270880 + 16784164 * Z ) / 551357440
+33853346816 = 7443325440 + 732270880 + 16784164 * Z
+25677750496 = 16784164 * Z 
+Z = 1529
+```
+
+Good to know! I find spending 1500 cycles to send the sample pretty reasonable, but for a dataplane
+that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220 CPU
+cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets
+feels dangerous to me.
+
+Here's where I start my conjecture. If I count the CPU cycles spent in the table above, I will see
+273 CPU cycles spent on average per packet. The CPU in the VPP router is an `E5-2696 v4 @ 2.20GHz`,
+which means it should be able to do `2.2e10/273 = 8.06Mpps` per thread, more than double that what I
+observe (3.16Mpps)! But, for all the `vector rates in` (3.1605e6), it also managed to emit the
+packets back out (same number: 3.1607e6).
+
+So why isn't VPP getting more packets from DPDK?  I poke around a bit and find an important clue:
+
+```
+pim@hvn6-lab:~$ vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed; \
+                sleep 10; vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed
+    rx missed                                     4065539464
+    rx missed                                     4182788310
+```
+
+In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. But I already know that it
+performed 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted
+for! It's just, DPDK never managed to read them from the hardware.
+
+
+As a validation, I turned off sFlow while keeping that one port at 14.88Mpps. Now, 10.8Mpps were
+delivered:
+
+```
+Thread 2 vpp_wk_1 (lcore 3)
+Time 14.7, 10 sec internal node vector rate 256.00 loops/sec 40622.64
+  vector rates in 1.0794e7, out 1.0794e7, drop 0.0000e0, punt 0.0000e0
+             Name                 State     Calls    Vectors   Suspends   Clocks   Vectors/Call  
+TenGigabitEthernet130/0/1-outp   active    620012  158723072          0   5.66e0         256.00
+TenGigabitEthernet130/0/1-tx     active    620012  158723072          0   7.01e1         256.00
+dpdk-input                       polling   620012  158723072          0   4.39e1         256.00
+ethernet-input                   active    620012  158723072          0   1.56e1         256.00
+ip4-input-no-checksum            active    620012  158723072          0   1.43e1         256.00
+ip4-load-balance                 active    620012  158723072          0   1.11e1         256.00
+ip4-lookup                       active    620012  158723072          0   2.00e1         256.00
+ip4-rewrite                      active    620012  158723072          0   2.02e1         256.00
+```
+
+Total Clocks: 201 per packet; 2.2GHz/201 = 10.9Mpps, and I am observing 10.8Mpps. As [[North of the
+Border](https://www.youtube.com/c/NorthoftheBorder)] would say: "That's not just good, it's good
+_enough_!"
+
+For completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps), and saw that
+about 29Mpps of that made it through. Interestingly, what was 3.16Mpps in the single-port line rate
+loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow threads
+are also impacted. I spent some time thinking about this and poking around, but I did not find a
+good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted. Here's a
+screenshot of VPP on the struggle bus:
+
+{{< image width="100%" src="/assets/sflow/trex-overload.png" alt="Cisco T-Rex: overload at line rate" >}}
+
+**Hypothesis**: Due to the _spinlock_ in `vl_api_rpc_call_main_thread()`, the worker CPU is pegged
+for a longer time, during which the `dpdk-input` PMD can't run, so it misses out on these sweet
+sweet packets that the network card had dutifully received for it, resulting in the `rx-miss`
+situation.  While VPP's performance measurement shows 273 CPU cycles per packet and 3.16Mpps, this
+accounts only for 862M cycles, while the thread has 2200M cycles, leaving a whopping 60% of CPU
+cycles unused in the dataplane. I still don't understand why _other_ worker threads are impacted,
+though.
+
+## What's Next
+
+I'll continue to work with the folks in the sFlow and VPP communities and iterate on the plugin and
+other **sFlow Agent** machinery. In an upcoming article, I hope to share more details on how to tie
+the VPP plugin in to the `hsflowd` host sflow daemon in a way that the interface indexes, counters
+and pcaket lengths are all correct
+
+I should do a few more tests with a debug binary and profiling turned on. I quickly ran a `perf`
+over the VPP (release / optimized)  binary running on the bench, but it merely said 80% of time was
+spent in `libvlib` rather than `libvnet` in the baseline (sFlow turned off).
+
+```
+root@hvn6-lab:/home/pim# perf record -p 1752441 sleep 10
+root@hvn6-lab:/home/pim# perf report --stdio --sort=dso 
+# Overhead  Shared Object (sFlow)   Overhead  Shared Object (baseline)
+# ........  ......................  ........  ........................
+#
+    79.02%  libvlib.so.24.10          54.27%  libvlib.so.24.10      
+    12.82%  libvnet.so.24.10          33.91%  libvnet.so.24.10      
+     3.77%  dpdk_plugin.so            10.87%  dpdk_plugin.so        
+     3.21%  [kernel.kallsyms]          0.81%  [kernel.kallsyms]     
+     0.29%  sflow_plugin.so            0.09%  ld-linux-x86-64.so.2  
+     0.28%  libvppinfra.so.24.10       0.03%  libc.so.6 
+     0.21%  libc.so.6                  0.01%  libvppinfra.so.24.10  
+     0.17%  libvlibapi.so.24.10        0.00%  libvlibmemory.so.24.10
+     0.15%  libvlibmemory.so.24.10     
+     0.07%  ld-linux-x86-64.so.2  
+     0.00%  vpp                   
+     0.00%  [vdso]                
+     0.00%  libsvm.so.24.10       
+```
+
+Unfortunately, I'm not much of a profiler expert, being merely a network engineer :) so I may have
+to ask for help. Of course, if you're reading this, you may also _offer_ help! There's lots of
+interesting work to do on this `sflow` plugin, with matching ifIndex for consumers like `hsflowd`,
+reading interface counters from the dataplane (or from the Prometheus Exporter), and most
+importantly, ensuring it works well, or fails gracefully, under stringent load.
+
+## Acknowledgements
+
+The plugin I am testing here is a prototype written by Neil McKee of inMon. I also wanted to say
+thanks to Pavel Odintsov of FastNetMon and Ciprian Balaceanu for showing an interest in this plugin,
+and Peter Phaal for facilitating a get-together last year.
+
+Who's up for making this thing a reality?!
diff --git a/static/assets/sflow/sflow.gif b/static/assets/sflow/sflow.gif
new file mode 100644
index 0000000..59a0723
--- /dev/null
+++ b/static/assets/sflow/sflow.gif
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:1f0c58462a399d3c3a5e79ac70997207e173343f06b6976b093427527f3989ee
+size 2128
diff --git a/static/assets/sflow/trex-baseline.png b/static/assets/sflow/trex-baseline.png
new file mode 100644
index 0000000..462cb88
--- /dev/null
+++ b/static/assets/sflow/trex-baseline.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:0c1e0c5576c211e08392a03292cd9f37b5c6759f9a4e3e1e195f3678c8a90265
+size 1113146
diff --git a/static/assets/sflow/trex-overload.png b/static/assets/sflow/trex-overload.png
new file mode 100644
index 0000000..48b75ce
--- /dev/null
+++ b/static/assets/sflow/trex-overload.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:efd01ff996582e01759d7a61fc99169949198d6925992dd5a74d6ca919910432
+size 1124291
diff --git a/static/assets/sflow/trex-passthru.png b/static/assets/sflow/trex-passthru.png
new file mode 100644
index 0000000..fb76040
--- /dev/null
+++ b/static/assets/sflow/trex-passthru.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8a791772479d07324e17d08b1e96c6f9855045930415219606f87446edb491b8
+size 1094333