Add article sflow-3

2025-02-09 17:51:05 +01:00
parent ce6e6cde22
commit bcbb119b20
9 changed files with 864 additions and 0 deletions
--- a/content/articles/2025-02-08-sflow-3.md
+++ b/content/articles/2025-02-08-sflow-3.md
@@ -0,0 +1,849 @@
+---
+date: "2025-02-08T07:51:23Z"
+title: 'VPP with sFlow - Part 3'
+---
+
+# Introduction
+
+{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width="12em" >}}
+
+In the second half of last year, I picked up a project together with Neil McKee of
+[[inMon](https://inmon.com/)], the care takers of [[sFlow](https://sflow.org)]: an industry standard
+technology for monitoring high speed networks. `sFlow` gives complete visibility into the
+use of networks enabling performance optimization, accounting/billing for usage, and defense against
+security threats.
+
+The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
+forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
+[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the
+so called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for
+a small portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but
+also in the VPP software dataplane. The agent then _transmits_ these samples using a Linux kernel
+feature called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)].
+This greatly reduces the complexity of code to be implemented in the forwarding path, while at the
+same time bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business
+logic for the more complex state keeping, packet marshalling and transmission from the _Agent_ to a
+central _Collector_.
+
+In this third article, I wanted to spend some time discussing how samples make their way out of the
+VPP dataplane, and into higher level tools.
+
+## Recap: sFlow
+
+{{< image float="left" src="/assets/sflow/sflow-overview.png" alt="sFlow Overview" width="14em" >}}
+
+sFlow describes a method for Monitoring Traffic in Switched/Routed Networks, originally described in
+[[RFC3176](https://datatracker.ietf.org/doc/html/rfc3176)]. The current specification is version 5
+and is homed on the sFlow.org website [[ref](https://sflow.org/sflow_version_5.txt)]. Typically, a
+Switching ASIC in the dataplane (seen at the bottom of the diagram to the left) is asked to copy
+1-in-N packets to local sFlow Agent.
+
+**Samples**: The agent will copy the first N bytes (typically 128) of the packet into a sample. As
+the ASIC knows which interface the packet was received on, the `inIfIndex` will be added. After the
+egress port(s) are found in an L2 FIB, or a next hop (and port) is found after a routing decision,
+the ASIC can annotate the sample with this `outIfIndex` and `DstMAC` metadata as well.
+
+**Drop Monitoring**: There's one rather clever insight that sFlow gives: what if the packet _was
+not_ routed or switched, but rather discarded?  For this, sFlow is able to describe the reason for
+the drop. For example, the ASIC receive queue could have been overfull, or it did not find a
+destination to forward the packet to (no FIB entry), perhaps it was instructed by an ACL to drop the
+packet or maybe even tried to transmit the packet but the physical datalink layer had to abandon the
+transmission for whatever reason (link down, TX queue full, link saturation, and so on). It's hard
+to overstate how important it is to have this so-called _drop monitoring_, as operators often spend
+hours and hours figuring out _why_ packets are lost their network or datacenter switching fabric.
+
+**Metadata**: The agent may have other metadata as well, such as which prefix was the source and
+destination of the packet, what additional RIB information do we have (AS path, BGP communities, and
+so on).  This may be added to the sample record as well.
+
+**Counters**: Since we're doing sampling of 1:N packets, we can estimate total traffic in a
+reasonably accurate way. Peter and Sonia wrote a succint
+[[paper](https://sflow.org/packetSamplingBasics/)] about the math, so I won't get into that here.
+Mostly because I am but a software engineer, not a statistician... :) However, I will say this: if we
+sample a fraction of the traffic but know how many bytes and packets we saw in total, we can provide
+an overview with a quantifiable accuracy. This is why the Agent will periodically get the interface
+counters from the ASIC.
+
+**Collector**: One or more samples can be concatenated into UDP messages that go from the Agent to a
+central _sFlow Collector_. The heavy lifting in analysis is done upstream from the switch or router,
+which is great for performance.  Many thousands or even tens of thousands of agents can forward
+their samples and interface counters to a single central collector, which in turn can be used to
+draw up a near real time picture of the state of traffic through even the largest of ISP networks or
+datacenter switch fabrics.
+
+In sFlow parlance [[VPP](https://fd.io/)] and its companion `hsflowd` is an _Agent_ (it sends the
+UDP packets over the network), and for example the commandline tool `sflowtool` could be a
+_Collector_ (it receives the UDP packets).
+
+## Recap: sFlow in VPP
+
+First, I have some pretty good news to report - our work on this plugin was
+[[merged](https://gerrit.fd.io/r/c/vpp/+/41680)] and will be included in the VPP 25.02 release in a
+few weeks! Last weekend, I gave a lightning talk at
+[[FOSDEM](https://fosdem.org/2025/schedule/event/fosdem-2025-4196-vpp-monitoring-100gbps-with-sflow/)]
+and caught up with a lot of community members and network- and software engineers. I had a great
+time.
+
+in the dataplane low, we get both high performance, and a smaller probability of bugs causing harm.
+And I do like simple implementations, as they tend to cause less _SIGSEGVs_. The architecture of the
+end to end solution consists of three distinct parts, each with their own risk and performance
+profile:
+
+{{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}}
+
+**sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
+packets from `device-input` to the `ethernet-input` nodes in its forwarding graph, the sFlow plugin
+will inspect 1-in-N, taking a sample for further processing. Here, we don't try to be clever, simply
+copy the `inIfIndex` and the first bytes of the ethernet frame, and append them to a
+[[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))] queue. If too many samples
+arrive, samples are dropped at the tail, and a counter incremented.  This way, I can tell when the
+dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to
+get their fair share of samples into the Agent's hands.
+
+**sFlow main process**: There's a function running on the _main thread_, which shifts further
+processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it
+consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones
+in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is
+configurable), it'll grab all interface counters from those interfaces for which we have sFlow
+turned on. VPP produces _Netlink_ messages and sends them to the kernel.
+
+**host-sflow**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
+messages. It goes without saying that `hsflowd` is a battle-hardened implementation running on
+hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy,
+this module already exists. But Neil implements a _mod_vpp_ which can grab interface names and their
+`ifIndex`, and counter statistics. VPP emits this data as _Netlink_ `USERSOCK` messages alongside
+the PSAMPLEs.
+
+
+By the way, I've written about _Netlink_ before when discussing the [[Linux Control Plane]({{< ref
+2021-08-25-vpp-4 >}})] plugin.  It's a mechanism for programs running in userspace to share
+information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from
+kernel to userspace using a _PSAMPLE_ Netlink channel. However, the pattern is that of a message
+producer/subscriber queue and nothing precludes one userspace process (VPP) to be the producer while
+another (hsflowd) is the consumer!
+
+Assuming the sFlow plugin in VPP produces samples and counters properly, `hsflowd` will do the rest,
+giving correctness and upstream interoperability pretty much for free. That's slick!
+
+### VPP: sFlow Configuration
+
+The solution that I offer is based on two moving parts. First, the VPP plugin configuration, which
+turns on sampling at a given rate on physical devices, also known as _hardware-interfaces_. Second,
+the open source component [[host-sflow](https://github.com/sflow/host-sflow/releases)] can be
+configured as of release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
+
+I can configure VPP in three ways:
+
+***1. VPP Configuration via CLI***
+
+```
+pim@vpp0-0:~$ vppctl
+vpp0-0# sflow sampling-rate 100
+vpp0-0# sflow polling-interval 10
+vpp0-0# sflow header-bytes 128
+vpp0-0# sflow enable GigabitEthernet10/0/0
+vpp0-0# sflow enable GigabitEthernet10/0/0 disable
+vpp0-0# sflow enable GigabitEthernet10/0/2
+vpp0-0# sflow enable GigabitEthernet10/0/3
+```
+
+The first three commands set the global defaults - in my case I'm going to be sampling at 1:100
+which is unusually high frequency. A production setup may take 1:<linkspeed-in-megabits> so for a
+1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more
+appropriate. The second command sets the interface stats polling interval. The default is to gather
+these statistics every 20 seconds, but I set it to 10s here. Then, I instruct the plugin how many
+bytes of the sampled ethernet frame should be taken. Common values are 64 and 128. I want enough
+data to see the headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but
+the contents of the payload are rarely interesting for statistics purposes. Finally, I can turn on
+the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP, an idiomatic way to
+turn on and off things is to have an enabler/disabler. It feels a bit clunky maybe to write `sflow
+enable $iface disable` but it makes more logical sends if you parse that as "enable-disable" with
+the default being the "enable" operation, and the alternate being the "disable" operation.
+
+***2. VPP Configuration via API***
+
+I wrote a few API calls for the most common operations. Here's a snippet that shows the same calls
+as from the CLI above, but in Python APIs:
+
+```python
+from vpp_papi import VPPApiClient, VPPApiJSONFiles
+import sys
+
+vpp_api_dir = VPPApiJSONFiles.find_api_dir([])
+vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir)
+vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock")
+vpp.connect("sflow-api-client")
+print(vpp.api.show_version().version)
+# Output: 25.06-rc0~14-g9b1c16039
+
+vpp.api.sflow_sampling_rate_set(sampling_N=100)
+print(vpp.api.sflow_sampling_rate_get())
+# Output: sflow_sampling_rate_get_reply(_0=655, context=3, sampling_N=100)
+
+vpp.api.sflow_polling_interval_set(polling_S=10)
+print(vpp.api.sflow_polling_interval_get())
+# Output: sflow_polling_interval_get_reply(_0=661, context=5, polling_S=10)
+
+vpp.api.sflow_header_bytes_set(header_B=128)
+print(vpp.api.sflow_header_bytes_get())
+# Output: sflow_header_bytes_get_reply(_0=665, context=7, header_B=128)
+
+vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=True)
+vpp.api.sflow_enable_disable(hw_if_index=2, enable_disable=True)
+print(vpp.api.sflow_interface_dump())
+# Output: [ sflow_interface_details(_0=667, context=8, hw_if_index=1),
+#           sflow_interface_details(_0=667, context=8, hw_if_index=2) ]
+
+print(vpp.api.sflow_interface_dump(hw_if_index=2))
+# Output: [ sflow_interface_details(_0=667, context=9, hw_if_index=2) ]
+
+print(vpp.api.sflow_interface_dump(hw_if_index=1234)) ## Invalid hw_if_index
+# Output: []
+
+vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=False)
+print(vpp.api.sflow_interface_dump())
+# Output: [ sflow_interface_details(_0=667, context=10, hw_if_index=2) ]
+```
+
+This short program toys around a bit with the sFlow API. I first set the sampling to 1:100 and get
+the current value. Then I set the polling interval to 10s and retrieve the current value again.
+Finally, I set the header bytes to 128, and retrieve the value again. 
+
+Adding and removing interfaces shows the idiom I mentioned before - the API being an
+`enable_disable()` call of sorts, and typically taking a flag if the operator wants to enable (the
+default), or disable sFlow on the interface. Getting the list of enabled interfaces can be done with
+the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details` messages.
+
+I wrote an article showing the Python API and how it works in a fair amount of detail in a
+[[previous article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests you.
+
+***3. VPPCfg YAML Configuration***
+
+Writing on the CLI and calling the API is good and all, but many users of VPP have noticed that it
+does not have any form of configuration persistence and that's on purpose. The project is a
+programmable dataplane, and explicitly has left the programming and configuration as an exercise for
+integrators. I have written a small Python program that takes a YAML file as input and uses it to
+configure (and reconfigure, on the fly) the dataplane automatically. It's called
+[[VPPcfg](https://github.com/pimvanpelt/vppcfg.git)], and I wrote some implementation thoughts on
+its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
+>}})] so I won't repeat that here. Instead, I will just show the configuration:
+
+```
+pim@vpp0-0:~$ cat << EOF > vppcfg.yaml
+interfaces:
+  GigabitEthernet10/0/0:
+    sflow: true
+  GigabitEthernet10/0/1:
+    sflow: true
+  GigabitEthernet10/0/2:
+    sflow: true
+  GigabitEthernet10/0/3:
+    sflow: true
+
+sflow:
+  sampling-rate: 100
+  polling-interval: 10
+  header-bytes: 128
+EOF
+pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp
+[INFO    ] root.main: Loading configfile vppcfg.yaml
+[INFO    ] vppcfg.config.valid_config: Configuration validated successfully
+[INFO    ] root.main: Configuration is valid
+[INFO    ] vppcfg.reconciler.write: Wrote 13 lines to /etc/vpp/config/vppcfg.vpp
+[INFO    ] root.main: Planning succeeded
+pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp
+```
+
+The slick thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
+1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the `vppcfg plan` and `vppcfg
+apply` stages and the VPP dataplane will reflect the newly declared configuration.
+
+### hsflowd: Configuration
+
+VPP will start to emit _Netlink_ messages, of type PSAMPLE with packet samples and of type USERSOCK
+with the custom messages about the interface names and counters. These latter custom messages have
+to be decoded, which is done by the _mod_vpp_ module, starting from release v2.11-5
+[[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
+
+Here's a minimalist configuration:
+
+```
+pim@vpp0-0:~$ cat /etc/hsflowd.conf
+sflow {
+  collector { ip=127.0.0.1 udpport=16343 }
+  collector { ip=192.0.2.1 namespace=dataplane }
+  psample { group=1 }
+  vpp { osIndex=off }
+}
+```
+
+{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
+
+There are two important details that can be confusing at first: \
+**1.** kernel network namespaces \
+**2.** interface index namespaces
+
+#### hsflowd: Network namespace
+
+When started by systemd, `hsflowd` and VPP will normally both run in the _default_ network
+namespace. Network namespaces virtualize Linux's network stack. On creation, a network namespace
+contains only a loopback interface, and subsequently interfaces can be moved between namespaces.
+Each network namespace will have its own set of IP addresses, its own routing table, socket listing,
+connection tracking table, firewall, and other network-related resources. 
+
+Given this, I can conclude that when the sFlow plugin opens a Netlink channel, it will
+naturally do this in the network namespace that its VPP process is running in (the _default_
+namespace by default). It is therefore important that the recipient of all the Netlink messages,
+notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them in a
+different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.
+
+It might pose a problem if the network connectivity lives in a different namespace than the default
+one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface
+pairs, _LIPs_, in a dataplane namespace. The main reason for doing this is to allow something like
+FRR or Bird to completely govern the routing table in the kernel and keep it in-sync with the FIB in
+VPP. In such a _dataplane_ network namespace, typically every interface is owned by VPP.
+
+Luckily, `hsflowd` can attach to one (default) namespace to get the PSAMPLEs, but create a socket in
+a _different_ (dataplane) namespace to send packets to a collector. This explains the second
+_collector_ entry in the config-file above. Here, `hsflowd` will send UDP packets to 192.0.2.1:6343
+from within the (VPP) dataplane namespace, and to 127.0.0.1:16343 in the default namespace.
+
+#### hsflowd: osIndex
+
+I hope the previous section made some sense, because this one will be a tad more esoteric. When
+creating a network namespace, each interface will get its own uint32 interface index that identifies
+it, and such an ID is typically called an `ifIndex`. It's important to note that the same number can
+(and will!) occur multiple times, once for each namespace. Let me give you an example:
+
+```
+pim@summer:~$ ip link
+1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
+    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
+2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ipng-sl state UP ...
+    link/ether 00:22:19:6a:46:2e brd ff:ff:ff:ff:ff:ff
+    altname enp1s0f0
+3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 900 qdisc mq master ipng-sl state DOWN ...
+    link/ether 00:22:19:6a:46:30 brd ff:ff:ff:ff:ff:ff
+    altname enp1s0f1
+
+pim@summer:~$ ip netns exec dataplane ip link
+1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
+    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
+2: loop0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
+    link/ether de:ad:00:00:00:00 brd ff:ff:ff:ff:ff:ff
+3: xe1-0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
+    link/ether 00:1b:21:bd:c7:18 brd ff:ff:ff:ff:ff:ff
+```
+
+I want to draw your attention to the number at the beginning of the line. In the _default_
+namespace, `ifIndex=3` corresponds to `ifName=eno2` (which has no link, it's marked `DOWN`). But in
+the _dataplane_ namespace, that index corresponds to a completely different interface called
+`ifName=xe1-0` (which is link `UP`).
+
+Now, let me show you the interfaces in VPP:
+
+```
+pim@summer:~$ vppctl show int | grep Gigabit | egrep 'Name|loop0|tap0|Gigabit'
+              Name               Idx    State  MTU (L3/IP4/IP6/MPLS)
+GigabitEthernet4/0/0              1      up          9000/0/0/0
+GigabitEthernet4/0/1              2     down         9000/0/0/0
+GigabitEthernet4/0/2              3     down         9000/0/0/0
+GigabitEthernet4/0/3              4     down         9000/0/0/0
+TenGigabitEthernet5/0/0           5      up          9216/0/0/0
+TenGigabitEthernet5/0/1           6      up          9216/0/0/0
+loop0                             7      up          9216/0/0/0
+tap0                              19     up          9216/0/0/0
+```
+
+Here, I want you to look at the second column `Idx`, which shows what VPP calls the _sw_if_index_
+(the software interface index, as opposed to hardware index). Here, `ifIndex=3` corresponds to
+`ifName=GigabitEthernet4/0/2`, which is neither `eno2` nor `xe1-0`. Oh my, yet _another_ namespace!
+
+So there are three (relevant) types of namespaces at play here:
+1.  ***Linux network*** namespace; here using `dataplane` and `default` each with overlapping numbering
+1.  ***VPP hardware*** interface namespace, also called PHYs (for physical interfaces). When VPP
+    first attaches to or creates fundamental network interfaces like from DPDK or RDMA, these will
+    create an _hw_if_index_ in a list.
+1.  ***VPP software*** interface namespace. Any interface (including hardware ones!) will also
+    receive a _sw_if_index_ in VPP. A good example is sub-interfaces: if I create a sub-int on
+    GigabitEthernet4/0/2, it will NOT get a hardware index, but it _will_ get the next available
+    software index (in this example, `sw_if_index=7`).
+
+In Linux CP, I can map one into the other, look at this:
+
+```
+pim@summer:~$ vppctl show lcp
+lcp default netns dataplane
+lcp lcp-auto-subint off
+lcp lcp-sync on
+lcp lcp-sync-unnumbered on
+itf-pair: [0] loop0 tap0 loop0 2 type tap netns dataplane
+itf-pair: [1] TenGigabitEthernet5/0/0 tap1 xe1-0 3 type tap netns dataplane
+itf-pair: [2] TenGigabitEthernet5/0/1 tap2 xe1-1 4 type tap netns dataplane
+itf-pair: [3] TenGigabitEthernet5/0/0.20 tap1.20 xe1-0.20 5 type tap netns dataplane
+```
+
+Those `itf-pair` describe our _LIPs_, and they have the coordinates to three things. 1) The VPP
+software interface (VPP `ifName=loop0` with `sw_if_index=7`), which 2) Linux CP will mirror into the
+Linux kernel using a TAP device (VPP `ifName=tap0` with `sw_if_index=19`). That TAP has one leg in
+VPP (`tap0`), and another in 3) Linux (with `ifName=loop` and `ifIndex=2` in namespace `dataplane`).
+
+> So the tuple that fully describes a _LIP_ is `{7, 19,'dataplane', 2}`
+
+Climbing back out of that rabbit hole, I am now finally ready to explain the feature. When sFlow in
+VPP takes its sample, it will be doing this on a PHY, that is a given interface with a specific
+_hw_if_index_. When it polls the counters, it'll do it for that specific _hw_if_index_. It now has a
+choice: should it share with the world the representation of *its* namespace, or should it try to be
+smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the
+plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, look up a _LIP_
+with it. If it finds one, it'll know both the namespace in which it lives as well as the osIndex in
+that namespace. If it doesn't, it will at least have the _sw_if_index_ at hand, so it'll annotate the
+USERSOCK with this information.
+
+Now, `hsflowd` has a choice to make: does it share the Linux representation and hide VPP as an
+implementation detail? Or does it share the VPP dataplane _sw_if_index_? There are use cases
+relevant to both, so the decision was to let the operator decide, by setting `osIndex` either `on`
+(use Linux ifIndex) or `off` (use VPP sw_if_index).
+
+### hsflowd: Host Counters
+
+Now that I understand the configuration parts of VPP and `hsflowd`, I decide to configure everything
+but I don't turn on any interfaces yet in VPP. Once I start the daemon, I can see that it
+sends an UDP packet every 30 seconds to the configured _collector_:
+
+```
+pim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n
+tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
+listening on lo, link-type EN10MB (Ethernet), snapshot length 9000 bytes
+15:34:19.695042 IP 127.0.0.1.48753 > 127.0.0.1.6343: sFlowv5,
+   IPv4 agent 198.19.5.16, agent-id 100000, length 716
+```
+
+The `tcpdump` I have on my Debian bookworm machines doesn't know how to decode the contents of these
+sFlow packets. Actually, neither does Wireshark. I've attached a file of these mysterious packets
+[[sflow-host.pcap](/assets/sflow/sflow-host.pcap)] in case you want to take a look.
+Neil however gives me a tip. A full message decoder and otherwise handy Swiss army knife lives in
+[[sflowtool](https://github.com/sflow/sflowtool)].
+
+I can offer this pcap file to `sflowtool`, or let it just listen on the UDP port directly, and
+it'll tell me what it finds:
+
+```
+pim@vpp0-0:~$ sflowtool -p 6343      
+startDatagram =================================
+datagramSourceIP 127.0.0.1
+datagramSize 716       
+unixSecondsUTC 1739112018
+localtime 2025-02-09T15:40:18+0100
+datagramVersion 5            
+agentSubId 100000   
+agent 198.19.5.16        
+packetSequenceNo 57 
+sysUpTime 987398   
+samplesInPacket 1            
+startSample ----------------------
+sampleType_tag 0:4                                                         
+sampleType COUNTERSSAMPLE
+sampleSequenceNo 33
+sourceId 2:1                
+counterBlock_tag 0:2001           
+adaptor_0_ifIndex 2                                                        
+adaptor_0_MACs 1   
+adaptor_0_MAC_0 525400f00100
+counterBlock_tag 0:2010
+udpInDatagrams 123904
+udpNoPorts 23132459
+udpInErrors 0
+udpOutDatagrams 46480629
+udpRcvbufErrors 0
+udpSndbufErrors 0
+udpInCsumErrors 0
+counterBlock_tag 0:2009
+tcpRtoAlgorithm 1
+tcpRtoMin 200
+tcpRtoMax 120000
+tcpMaxConn 4294967295
+tcpActiveOpens 0
+tcpPassiveOpens 30
+tcpAttemptFails 0
+tcpEstabResets 0
+tcpCurrEstab 1
+tcpInSegs 89120
+tcpOutSegs 86961
+tcpRetransSegs 59
+tcpInErrs 0
+tcpOutRsts 4
+tcpInCsumErrors 0
+counterBlock_tag 0:2008
+icmpInMsgs 23129314
+icmpInErrors 32
+icmpInDestUnreachs 0
+icmpInTimeExcds 23129282
+icmpInParamProbs 0
+icmpInSrcQuenchs 0
+icmpInRedirects 0
+icmpInEchos 0
+icmpInEchoReps 32
+icmpInTimestamps 0
+icmpInAddrMasks 0
+icmpInAddrMaskReps 0
+icmpOutMsgs 0
+icmpOutErrors 0
+icmpOutDestUnreachs 23132467
+icmpOutTimeExcds 0
+icmpOutParamProbs 23132467
+icmpOutSrcQuenchs 0
+icmpOutRedirects 0
+icmpOutEchos 0
+icmpOutEchoReps 0
+icmpOutTimestamps 0
+icmpOutTimestampReps 0
+icmpOutAddrMasks 0
+icmpOutAddrMaskReps 0
+counterBlock_tag 0:2007
+ipForwarding 2
+ipDefaultTTL 64
+ipInReceives 46590552
+ipInHdrErrors 0
+ipInAddrErrors 0
+ipForwDatagrams 0
+ipInUnknownProtos 0
+ipInDiscards 0
+ipInDelivers 46402357
+ipOutRequests 69613096
+ipOutDiscards 0
+ipOutNoRoutes 80
+ipReasmTimeout 0
+ipReasmReqds 0
+ipReasmOKs 0
+ipReasmFails 0
+ipFragOKs 0
+ipFragFails 0
+ipFragCreates 0
+counterBlock_tag 0:2005
+disk_total 6253608960
+disk_free 2719039488
+disk_partition_max_used 56.52
+disk_reads 11512
+disk_bytes_read 626214912
+disk_read_time 48469
+disk_writes 1058955
+disk_bytes_written 8924332032
+disk_write_time 7954804
+counterBlock_tag 0:2004
+mem_total 8326963200
+mem_free 5063872512
+mem_shared 0
+mem_buffers 86425600
+mem_cached 827752448
+swap_total 0
+swap_free 0
+page_in 306365
+page_out 4357584
+swap_in 0
+swap_out 0
+counterBlock_tag 0:2003
+cpu_load_one 0.030
+cpu_load_five 0.050
+cpu_load_fifteen 0.040
+cpu_proc_run 1
+cpu_proc_total 138
+cpu_num 2
+cpu_speed 1699
+cpu_uptime 1699306
+cpu_user 64269210
+cpu_nice 1810
+cpu_system 34690140
+cpu_idle 3234293560
+cpu_wio 3568580
+cpuintr 0
+cpu_sintr 5687680
+cpuinterrupts 1596621688
+cpu_contexts 3246142972
+cpu_steal 329520
+cpu_guest 0
+cpu_guest_nice 0
+counterBlock_tag 0:2006
+nio_bytes_in 250283
+nio_pkts_in 2931
+nio_errs_in 0
+nio_drops_in 0
+nio_bytes_out 370244
+nio_pkts_out 1640
+nio_errs_out 0
+nio_drops_out 0
+counterBlock_tag 0:2000
+hostname vpp0-0
+UUID ec933791-d6af-7a93-3b8d-aab1a46d6faa
+machine_type 3
+os_name 2
+os_release 6.1.0-26-amd64
+endSample   ----------------------
+endDatagram   =================================
+```
+
+If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might I
+agree with you. But it is really cool to see that every 30 seconds, the _collector_ will receive
+this form of heartbeat from the _agent_. There's a lot of vitalsigns in this packet, including some
+non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version
+information. It's super dope!
+
+### hsflowd: Interface Counters
+
+Next, I'll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to
+something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed,
+every ten seconds or so I get a few packets, which I captured them in
+[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Some of the packets contain only one
+counter record, while others contain more than one (in the PCAP, packet #9 has two). If I update the
+polling-interval to every second, I can see that most of the packets have four counters.
+
+Those interface counters, as decoded by `sflowtool`, look like this:
+
+```
+pim@vpp0-0:~$ sflowtool -r sflow-interface.pcap | \
+              awk '/startSample/ { on=1 } { if (on) { print $0 } } /endSample/ { on=0 }'
+startSample   ----------------------
+sampleType_tag 0:4
+sampleType COUNTERSSAMPLE
+sampleSequenceNo 745
+sourceId 0:3
+counterBlock_tag 0:1005
+ifName GigabitEthernet10/0/2
+counterBlock_tag 0:1
+ifIndex 3
+networkType 6
+ifSpeed 0
+ifDirection 1
+ifStatus 3
+ifInOctets 858282015
+ifInUcastPkts 780540
+ifInMulticastPkts 0
+ifInBroadcastPkts 0
+ifInDiscards 0
+ifInErrors 0
+ifInUnknownProtos 0
+ifOutOctets 1246716016
+ifOutUcastPkts 975772
+ifOutMulticastPkts 0
+ifOutBroadcastPkts 0
+ifOutDiscards 127
+ifOutErrors 28
+ifPromiscuousMode 0
+endSample   ----------------------
+```
+
+What I find particularly cool about it, is that sFlow provides an automatic mapping between the
+`ifName=GigabitEthernet10/0/2` (tag 0:1005), together and an object (tag 0:1), which contains the
+`ifIndex=3`, and lots of packet and octet counters both in the ingress and egress direction. This is
+super useful for upstream _collectors_, as they can now find the hostname, agent name and address,
+and the correlation between interface names and their indexes. Noice!
+
+#### hsflowd: Packet Samples
+
+Now it's time to ratchet up the packet sampling, so I move it from 1:100M to 1:1000, while keeping
+the interface polling-interval at 10 seconds and I ask VPP to sample 64 bytes of each packet that it
+inspects. On either side of my pet VPP instance, I start an `iperf3` run to generate some traffic. I
+now see a healthy stream of sflow packets coming in on port 6343. They still contain every 30
+seconds or so a host counter, and every 10 seconds a set of interface counters come by, but mostly
+these UDP packets are showing me samples. I've captured a few minutes of these in
+[[sflow-all.pcap](/assets/sflow/sflow-all.pcap)].
+Although Wireshark doesn't know how to interpret the sFlow counter messages, it _does_ know how to
+interpret the sFlow sample messagess, and it reveals one of them like this:
+
+{{< image width="100%" src="/assets/sflow/sflow-wireshark.png" alt="sFlow Wireshark" >}}
+
+Let me take a look at the picture from top to bottom. First, the outer header (from 127.0.0.1:48753
+to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as
+having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it's going to
+send 9 samples, the first of which says it's from ifIndex=2 and at a sampling rate of 1:1000. It
+then shows the sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
+are sampled.  Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and
+DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my `iperf3`,
+booyah!
+
+### VPP: sFlow Performance
+
+{{< image float="right" src="/assets/sflow/sflow-lab.png" alt="sFlow Lab" width="20em" >}}
+
+One question I get a lot about this plugin is: what is the performance impact when using
+sFlow? I spent a considerable amount of time tinkering with this, and together with Neil bringing
+the plugin to what we both agree is the most efficient use of CPU. We could go a bit further, but
+that would require somewhat intrusive changes to VPP's internals and as _North of the Border_ would
+say: what we have isn't just good, it's good enough!
+
+I've built a small testbed based on two Dell R730 machines. On the left, I have a Debian machine
+running Cisco T-Rex using four quad-tengig network cards, the classic Intel i710-DA4. On the right,
+I have my VPP machine called _Hippo_ (because it's always hungry for packets), with the same
+hardware.  I'll build two halves. On the top NIC (Te3/0/0-3 in VPP), I will install IPv4 and MPLS
+forwarding on the purple circuit, and a simple Layer2 cross connect on the cyan circuit. On all four
+interfaces, I will enable sFlow. Then, I will mirror this configuration on the bottom NIC
+(Te130/0/0-3) in the red and green circuits, for which I will leave sFlow turned off.
+
+To help you reproduce my results, and under the assumption that this is your jam, here's the
+configuration for all of the kit:
+
+***0. Cisco T-Rex***
+```
+pim@trex:~ $ cat /srv/trex/8x10.yaml
+- version: 2
+  interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
+  port_info:
+    - src_mac:  00:1b:21:06:00:00
+      dest_mac: 9c:69:b4:61:a1:dc    # Connected to Hippo Te3/0/0, purple
+    - src_mac:  00:1b:21:06:00:01
+      dest_mac: 9c:69:b4:61:a1:dd    # Connected to Hippo Te3/0/1, purple
+    - src_mac:  00:1b:21:83:00:00
+      dest_mac: 00:1b:21:83:00:01    # L2XC via Hippo Te3/0/2, cyan
+    - src_mac:  00:1b:21:83:00:01
+      dest_mac: 00:1b:21:83:00:00    # L2XC via Hippo Te3/0/3, cyan
+
+    - src_mac:  00:1b:21:87:00:00
+      dest_mac: 9c:69:b4:61:75:d0    # Connected to Hippo Te130/0/0, red
+    - src_mac:  00:1b:21:87:00:01
+      dest_mac: 9c:69:b4:61:75:d1    # Connected to Hippo Te130/0/1, red
+    - src_mac:  9c:69:b4:85:00:00
+      dest_mac: 9c:69:b4:85:00:01    # L2XC via Hippo Te130/0/2, green
+    - src_mac:  9c:69:b4:85:00:01
+      dest_mac: 9c:69:b4:85:00:00    # L2XC via Hippo Te130/0/3, green
+pim@trex:~ $ sudo t-rex-64 -i -c 4 --cfg /srv/trex/8x10.yaml
+```
+
+When constructing the T-Rex configuration, I specifically set the destination MAC address for L3
+circuits (the purple and red ones) using Hippo's interface MAC address, which I can find with
+`vppctl show hardware-interfaces`. This way, T-Rex does not have to ARP for the VPP endpoint. On
+L2XC circuits (the cyan and green ones), VPP does not concern itself with the MAC addressing at
+all. It puts its interface in _promiscuous_ mode, and simply writes out any ethernet frame received,
+directly to the egress interface.
+
+***1. IPv4***
+```
+hippo# set int state TenGigabitEthernet3/0/0 up
+hippo# set int state TenGigabitEthernet3/0/1 up
+hippo# set int state TenGigabitEthernet130/0/0 up
+hippo# set int state TenGigabitEthernet130/0/1 up
+hippo# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
+hippo# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
+hippo# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
+hippo# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
+hippo# ip route add 16.0.0.0/24 via 100.64.0.0
+hippo# ip route add 48.0.0.0/24 via 100.64.1.0
+hippo# ip route add 16.0.2.0/24 via 100.64.4.0
+hippo# ip route add 48.0.2.0/24 via 100.64.5.0
+hippo# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
+hippo# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
+hippo# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
+hippo# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
+```
+
+By the way, one note to this last piece, I'm setting static IPv4 neighbors so that Cisco T-Rex 
+as well as VPP do not have to use ARP to resolve each other. You'll see above that the T-Rex
+configuration also uses MAC addresses exclusively. Setting the `ip neighbor` like this allows VPP
+to know where to send return traffic.
+
+***2. MPLS***
+```
+hippo# mpls table add 0
+hippo# set interface mpls TenGigabitEthernet3/0/0 enable
+hippo# set interface mpls TenGigabitEthernet3/0/1 enable
+hippo# set interface mpls TenGigabitEthernet130/0/0 enable
+hippo# set interface mpls TenGigabitEthernet130/0/1 enable
+hippo# mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
+hippo# mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
+hippo# mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
+hippo# mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
+```
+
+Here, the MPLS configuration implements a simple P-router, where incoming MPLS packets with label 16
+will be sent back to T-Rex on Te3/0/1 to the specified IPv4 nexthop (for which we already know the
+MAC address), and with label 16 removed and new label 17 imposed, in other words a SWAP operation.
+
+***3. L2XC***
+```
+hippo# set int state TenGigabitEthernet3/0/2 up
+hippo# set int state TenGigabitEthernet3/0/3 up
+hippo# set int state TenGigabitEthernet130/0/2 up
+hippo# set int state TenGigabitEthernet130/0/3 up
+hippo# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
+hippo# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
+hippo# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
+hippo# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
+```
+
+I've added a layer2 cross connect as well because it's computationally very cheap for VPP to receive
+an L2 (ethernet) datagram, and immediately transmit it on another interface. There's no FIB lookup
+and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as
+fast as it can!
+
+Here's how a loadtest looks like when sending 80Gbps at 256b packets on all eight interfaces:
+
+{{< image src="/assets/sflow/sflow-lab-trex.png" alt="sFlow T-Rex" width="100%" >}}
+
+The leftmost ports p0 <-> p1 are sending IPv4+MPLS, while ports p0 <-> p2 are sending ethernet back
+and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These
+four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 <-> p4 and p5 <-> p6
+respectively have sFlow turned off but with the same configuration.  They are my control, showing
+the CPU use without sFLow.
+
+**First conclusion**: This stuff works a treat. There is absolutely no impact of throughput at
+80Gbps with 47.6Mpps either _with_, or _without_ sFlow turned on. That's wonderful news, as it shows
+that the dataplane has more CPU available than is needed for any combination of functionality.
+
+But what _is_ the limit? For this, I'll take a deeper look at the runtime statistics by varying the
+CPU time spent and maximum throughput achievable on a single VPP worker, thus using a single CPU
+thread on this Hippo machine that has 44 cores and 44 hyperthreads. I switch the loadtester to emit
+64 byte ethernet packets, the smallest I'm allowed to send.
+
+| Loadtest | no sFlow | 1:1'000'000 | 1:10'000 | 1:1'000 | 1:100 |
+|-------------|-----------|-----------|-----------|-----------|-----------|
+| L2XC        | 14.88Mpps | 14.32Mpps | 14.31Mpps | 14.27Mpps | 14.15Mpps |
+| IPv4        | 10.89Mpps | 9.88Mpps  | 9.88Mpps  | 9.84Mpps  | 9.73Mpps  |
+| MPLS        | 10.11Mpps | 9.52Mpps  | 9.52Mpps  | 9.51Mpps  | 9.45Mpps  |
+| ***sFlow Packets*** / 10sec | N/A       | 337.42M total | 337.39M total | 336.48M total | 333.64M total |
+| .. Sampled    | &nbsp;    | 328 | 33.8k | 336k | 3.34M | 
+| .. Sent       | &nbsp;    | 328 | 33.8k | 336k | 1.53M |
+| .. Dropped    | &nbsp;    | 0   | 0     | 0    | 1.81M |
+
+Here I can make a few important observations.
+
+**Baseline**: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which
+implies that it has a little bit of CPU left over to do other work, if needed.  With IPv4, I can see
+that the throughput is CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
+know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The
+total capacity is 10.11Mpps when sFlow is turned off.
+
+**Overhead**: If I turn on sFLow on the interface, VPP will insert the _sflow-node_ into the
+dataplane graph between `device-input` and `ethernet-input`. It means that the sFlow node will see
+_every single_ packet, and it will have to move all of these into the next node, which costs about
+9.5 CPU cycles per packet.  The regression on L2XC is 3.8% but I have to note that VPP was not CPU
+bound on the L2XC so it used some CPU cycles which were still available, before regressing
+throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS.
+
+**Sampling Cost**: When then doing higher rates of sampling, the further regression is not _that_
+terrible.  Between 1:1'000'000 and 1:10'000, there's barely a noticeable difference. Even in the
+worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%.  The
+regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS).
+Of course, by using multiple hardware receive queues and multiple RX workers per interface, the cost
+can be kept well in hand.
+
+**Overload Protection**: At 1:1'000 and an effective rate of 33.65Mpps across all ports, I correctly
+observe 336k samples taken, and sent to PSAMPLE.  At 1:100 however, there are 3.34M samples, but
+they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream
+`sflow-main` thread and `hsflowd`. I can see that here, 1.81M samples have been dropped, while 1.53M
+samples made it through.  By the way, this means VPP is happily sending 153K samples/sec to the
+collector!
+
+## What's Next
+
+Now that I've seen the UDP packets from our agent to a collector on the wire, and also how
+incredibly efficient the sFlow sampling implementation turned out, I'm super motivated to
+continue the journey with higher level collector receivers like ntopng, sflow-rt or Akvorado. In an
+upcoming article, I'll describe how I rolled out Akvorado at IPng, and what types of changes would
+make the user experience even better (or simpler to understand, at least).
+
+### Acknowledgements
+
+I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
+finer details such as logging, error handling, API specifications, and documentation. He has been a
+true pleasure to work with and learn from. Also, thank you to the VPP developer community, notably
+Benoit, Florin, Damjan, Dave and Matt, for helping with the review and getting this thing merged in
+time for the 25.02 release.
--- a/static/assets/sflow/sflow-all.pcap
+++ b/static/assets/sflow/sflow-all.pcap
--- a/static/assets/sflow/sflow-host.pcap
+++ b/static/assets/sflow/sflow-host.pcap
--- a/static/assets/sflow/sflow-interface.pcap
+++ b/static/assets/sflow/sflow-interface.pcap
--- a/static/assets/sflow/sflow-lab-trex.png
+++ b/static/assets/sflow/sflow-lab-trex.png
--- a/static/assets/sflow/sflow-lab.png
+++ b/static/assets/sflow/sflow-lab.png
--- a/static/assets/sflow/sflow-overview.png
+++ b/static/assets/sflow/sflow-overview.png
--- a/static/assets/sflow/sflow-vpp-overview.png
+++ b/static/assets/sflow/sflow-vpp-overview.png
--- a/static/assets/sflow/sflow-wireshark.png
+++ b/static/assets/sflow/sflow-wireshark.png