diff --git a/content/articles/2025-02-08-sflow-3.md b/content/articles/2025-02-08-sflow-3.md index 9b09e6d..0fa94b4 100644 --- a/content/articles/2025-02-08-sflow-3.md +++ b/content/articles/2025-02-08-sflow-3.md @@ -38,10 +38,10 @@ and is homed on the sFlow.org website [[ref](https://sflow.org/sflow_version_5.t Switching ASIC in the dataplane (seen at the bottom of the diagram to the left) is asked to copy 1-in-N packets to local sFlow Agent. -**Samples**: The agent will copy the first N bytes (typically 128) of the packet into a sample. As -the ASIC knows which interface the packet was received on, the `inIfIndex` will be added. After the -egress port(s) are found in an L2 FIB, or a next hop (and port) is found after a routing decision, -the ASIC can annotate the sample with this `outIfIndex` and `DstMAC` metadata as well. +**Sampling**: The agent will copy the first N bytes (typically 128) of the packet into a sample. As +the ASIC knows which interface the packet was received on, the `inIfIndex` will be added. After a +routing decision is made, the nexthop and its L2 address and interface become known. The ASIC might +annotate the sample with this `outIfIndex` and `DstMAC` metadata as well. **Drop Monitoring**: There's one rather clever insight that sFlow gives: what if the packet _was not_ routed or switched, but rather discarded? For this, sFlow is able to describe the reason for @@ -53,27 +53,28 @@ to overstate how important it is to have this so-called _drop monitoring_, as op hours and hours figuring out _why_ packets are lost their network or datacenter switching fabric. **Metadata**: The agent may have other metadata as well, such as which prefix was the source and -destination of the packet, what additional RIB information do we have (AS path, BGP communities, and -so on). This may be added to the sample record as well. +destination of the packet, what additional RIB information is available (AS path, BGP communities, +and so on). This may be added to the sample record as well. -**Counters**: Since we're doing sampling of 1:N packets, we can estimate total traffic in a +**Counters**: Since sFlow is sampling 1:N packets, the system can estimate total traffic in a reasonably accurate way. Peter and Sonia wrote a succint [[paper](https://sflow.org/packetSamplingBasics/)] about the math, so I won't get into that here. -Mostly because I am but a software engineer, not a statistician... :) However, I will say this: if we -sample a fraction of the traffic but know how many bytes and packets we saw in total, we can provide -an overview with a quantifiable accuracy. This is why the Agent will periodically get the interface -counters from the ASIC. +Mostly because I am but a software engineer, not a statistician... :) However, I will say this: if a +fraction of the traffic is sampled but the _Agent_ knows how many bytes and packets were forwarded +in total, it can provide an overview with a quantifiable accuracy. This is why the _Agent_ will +periodically get the interface counters from the ASIC. -**Collector**: One or more samples can be concatenated into UDP messages that go from the Agent to a -central _sFlow Collector_. The heavy lifting in analysis is done upstream from the switch or router, -which is great for performance. Many thousands or even tens of thousands of agents can forward -their samples and interface counters to a single central collector, which in turn can be used to -draw up a near real time picture of the state of traffic through even the largest of ISP networks or -datacenter switch fabrics. +**Collector**: One or more samples can be concatenated into UDP messages that go from the _sFlow +Agent_ to a central _sFlow Collector_. The heavy lifting in analysis is done upstream from the +switch or router, which is great for performance. Many thousands or even tens of thousands of +agents can forward their samples and interface counters to a single central collector, which in turn +can be used to draw up a near real time picture of the state of traffic through even the largest of +ISP networks or datacenter switch fabrics. -In sFlow parlance [[VPP](https://fd.io/)] and its companion `hsflowd` is an _Agent_ (it sends the -UDP packets over the network), and for example the commandline tool `sflowtool` could be a -_Collector_ (it receives the UDP packets). +In sFlow parlance [[VPP](https://fd.io/)] and its companion +[[hsflowd](https://github.com/sflow/host-sflow)] together form an _Agent_ (it sends the UDP packets +over the network), and for example the commandline tool `sflowtool` could be a _Collector_ (it +receives the UDP packets). ## Recap: sFlow in VPP @@ -81,13 +82,12 @@ First, I have some pretty good news to report - our work on this plugin was [[merged](https://gerrit.fd.io/r/c/vpp/+/41680)] and will be included in the VPP 25.02 release in a few weeks! Last weekend, I gave a lightning talk at [[FOSDEM](https://fosdem.org/2025/schedule/event/fosdem-2025-4196-vpp-monitoring-100gbps-with-sflow/)] -and caught up with a lot of community members and network- and software engineers. I had a great -time. +in Brussels, Belgium, and caught up with a lot of community members and network- and software +engineers. I had a great time. -in the dataplane low, we get both high performance, and a smaller probability of bugs causing harm. -And I do like simple implementations, as they tend to cause less _SIGSEGVs_. The architecture of the -end to end solution consists of three distinct parts, each with their own risk and performance -profile: +In trying to keep the amount of code as small as possible, and therefor the probability of bugs that +might impact VPP's dataplane stability low, the architecture of the end to end solution consists of +three distinct parts, each with their own risk and performance profile: {{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}} @@ -104,7 +104,7 @@ get their fair share of samples into the Agent's hands. processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is -configurable), it'll grab all interface counters from those interfaces for which we have sFlow +configurable), it'll grab all interface counters from those interfaces for which I have sFlow turned on. VPP produces _Netlink_ messages and sends them to the kernel. **host-sflow**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_ @@ -132,7 +132,7 @@ turns on sampling at a given rate on physical devices, also known as _hardware-i the open source component [[host-sflow](https://github.com/sflow/host-sflow/releases)] can be configured as of release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)]. -I can configure VPP in three ways: +I will show how to configure VPP in three ways: ***1. VPP Configuration via CLI*** @@ -148,22 +148,26 @@ vpp0-0# sflow enable GigabitEthernet10/0/3 ``` The first three commands set the global defaults - in my case I'm going to be sampling at 1:100 -which is unusually high frequency. A production setup may take 1: so for a +which is unusually high frequency. A production setup may take 1-in-_linkspeed-in-megabits_ so for a 1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more -appropriate. The second command sets the interface stats polling interval. The default is to gather -these statistics every 20 seconds, but I set it to 10s here. Then, I instruct the plugin how many -bytes of the sampled ethernet frame should be taken. Common values are 64 and 128. I want enough -data to see the headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but -the contents of the payload are rarely interesting for statistics purposes. Finally, I can turn on -the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP, an idiomatic way to -turn on and off things is to have an enabler/disabler. It feels a bit clunky maybe to write `sflow -enable $iface disable` but it makes more logical sends if you parse that as "enable-disable" with -the default being the "enable" operation, and the alternate being the "disable" operation. +appropriate, depending on link load. The second command sets the interface stats polling interval. +The default is to gather these statistics every 20 seconds, but I set it to 10s here. + +Next, I instruct the plugin how many bytes of the sampled ethernet frame should be taken. Common +values are 64 and 128. I want enough data to see the headers, like MPLS label(s), Dot1Q tag(s), IP +header and TCP/UDP/ICMP header, but the contents of the payload are rarely interesting for +statistics purposes. + +Finally, I can turn on the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP, +an idiomatic way to turn on and off things is to have an enabler/disabler. It feels a bit clunky +maybe to write `sflow enable $iface disable` but it makes more logical sends if you parse that as +"enable-disable" with the default being the "enable" operation, and the alternate being the +"disable" operation. ***2. VPP Configuration via API*** -I wrote a few API calls for the most common operations. Here's a snippet that shows the same calls -as from the CLI above, but in Python APIs: +I implemented a few API calls for the most common operations. Here's a snippet that shows the same +calls as from the CLI above, but using these Python API calls: ```python from vpp_papi import VPPApiClient, VPPApiJSONFiles @@ -209,13 +213,14 @@ This short program toys around a bit with the sFlow API. I first set the samplin the current value. Then I set the polling interval to 10s and retrieve the current value again. Finally, I set the header bytes to 128, and retrieve the value again. -Adding and removing interfaces shows the idiom I mentioned before - the API being an -`enable_disable()` call of sorts, and typically taking a flag if the operator wants to enable (the -default), or disable sFlow on the interface. Getting the list of enabled interfaces can be done with -the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details` messages. +Enabling and disabling sFlow on interfaces shows the idiom I mentioned before - the API being an +`enable_disable()` call of sorts, and typically taking a boolean argument if the operator wants to +enable (the default), or disable sFlow on the interface. Getting the list of enabled interfaces can +be done with the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details` +messages. -I wrote an article showing the Python API and how it works in a fair amount of detail in a -[[previous article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests you. +I demonstrated VPP's Python API and how it works in a fair amount of detail in a [[previous +article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests you. ***3. VPPCfg YAML Configuration*** @@ -667,9 +672,9 @@ booyah! One question I get a lot about this plugin is: what is the performance impact when using sFlow? I spent a considerable amount of time tinkering with this, and together with Neil bringing -the plugin to what we both agree is the most efficient use of CPU. We could go a bit further, but -that would require somewhat intrusive changes to VPP's internals and as _North of the Border_ would -say: what we have isn't just good, it's good enough! +the plugin to what we both agree is the most efficient use of CPU. We could have gone a bit further, +but that would require somewhat intrusive changes to VPP's internals and as _North of the Border_ +(and the Simpsons!) would say: what we have isn't just good, it's good enough! I've built a small testbed based on two Dell R730 machines. On the left, I have a Debian machine running Cisco T-Rex using four quad-tengig network cards, the classic Intel i710-DA4. On the right, @@ -754,7 +759,7 @@ hippo# mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out- ``` Here, the MPLS configuration implements a simple P-router, where incoming MPLS packets with label 16 -will be sent back to T-Rex on Te3/0/1 to the specified IPv4 nexthop (for which we already know the +will be sent back to T-Rex on Te3/0/1 to the specified IPv4 nexthop (for which I already know the MAC address), and with label 16 removed and new label 17 imposed, in other words a SWAP operation. ***3. L2XC*** @@ -812,7 +817,7 @@ know that MPLS is a little bit more expensive computationally than IPv4, and tha total capacity is 10.11Mpps when sFlow is turned off. **Overhead**: If I turn on sFLow on the interface, VPP will insert the _sflow-node_ into the -dataplane graph between `device-input` and `ethernet-input`. It means that the sFlow node will see +forwarding graph between `device-input` and `ethernet-input`. It means that the sFlow node will see _every single_ packet, and it will have to move all of these into the next node, which costs about 9.5 CPU cycles per packet. The regression on L2XC is 3.8% but I have to note that VPP was not CPU bound on the L2XC so it used some CPU cycles which were still available, before regressing