diff --git a/content/articles/2025-02-08-sflow-3.md b/content/articles/2025-02-08-sflow-3.md new file mode 100644 index 0000000..9b09e6d --- /dev/null +++ b/content/articles/2025-02-08-sflow-3.md @@ -0,0 +1,849 @@ +--- +date: "2025-02-08T07:51:23Z" +title: 'VPP with sFlow - Part 3' +--- + +# Introduction + +{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width="12em" >}} + +In the second half of last year, I picked up a project together with Neil McKee of +[[inMon](https://inmon.com/)], the care takers of [[sFlow](https://sflow.org)]: an industry standard +technology for monitoring high speed networks. `sFlow` gives complete visibility into the +use of networks enabling performance optimization, accounting/billing for usage, and defense against +security threats. + +The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it +forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and +[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the +so called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for +a small portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but +also in the VPP software dataplane. The agent then _transmits_ these samples using a Linux kernel +feature called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)]. +This greatly reduces the complexity of code to be implemented in the forwarding path, while at the +same time bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business +logic for the more complex state keeping, packet marshalling and transmission from the _Agent_ to a +central _Collector_. + +In this third article, I wanted to spend some time discussing how samples make their way out of the +VPP dataplane, and into higher level tools. + +## Recap: sFlow + +{{< image float="left" src="/assets/sflow/sflow-overview.png" alt="sFlow Overview" width="14em" >}} + +sFlow describes a method for Monitoring Traffic in Switched/Routed Networks, originally described in +[[RFC3176](https://datatracker.ietf.org/doc/html/rfc3176)]. The current specification is version 5 +and is homed on the sFlow.org website [[ref](https://sflow.org/sflow_version_5.txt)]. Typically, a +Switching ASIC in the dataplane (seen at the bottom of the diagram to the left) is asked to copy +1-in-N packets to local sFlow Agent. + +**Samples**: The agent will copy the first N bytes (typically 128) of the packet into a sample. As +the ASIC knows which interface the packet was received on, the `inIfIndex` will be added. After the +egress port(s) are found in an L2 FIB, or a next hop (and port) is found after a routing decision, +the ASIC can annotate the sample with this `outIfIndex` and `DstMAC` metadata as well. + +**Drop Monitoring**: There's one rather clever insight that sFlow gives: what if the packet _was +not_ routed or switched, but rather discarded? For this, sFlow is able to describe the reason for +the drop. For example, the ASIC receive queue could have been overfull, or it did not find a +destination to forward the packet to (no FIB entry), perhaps it was instructed by an ACL to drop the +packet or maybe even tried to transmit the packet but the physical datalink layer had to abandon the +transmission for whatever reason (link down, TX queue full, link saturation, and so on). It's hard +to overstate how important it is to have this so-called _drop monitoring_, as operators often spend +hours and hours figuring out _why_ packets are lost their network or datacenter switching fabric. + +**Metadata**: The agent may have other metadata as well, such as which prefix was the source and +destination of the packet, what additional RIB information do we have (AS path, BGP communities, and +so on). This may be added to the sample record as well. + +**Counters**: Since we're doing sampling of 1:N packets, we can estimate total traffic in a +reasonably accurate way. Peter and Sonia wrote a succint +[[paper](https://sflow.org/packetSamplingBasics/)] about the math, so I won't get into that here. +Mostly because I am but a software engineer, not a statistician... :) However, I will say this: if we +sample a fraction of the traffic but know how many bytes and packets we saw in total, we can provide +an overview with a quantifiable accuracy. This is why the Agent will periodically get the interface +counters from the ASIC. + +**Collector**: One or more samples can be concatenated into UDP messages that go from the Agent to a +central _sFlow Collector_. The heavy lifting in analysis is done upstream from the switch or router, +which is great for performance. Many thousands or even tens of thousands of agents can forward +their samples and interface counters to a single central collector, which in turn can be used to +draw up a near real time picture of the state of traffic through even the largest of ISP networks or +datacenter switch fabrics. + +In sFlow parlance [[VPP](https://fd.io/)] and its companion `hsflowd` is an _Agent_ (it sends the +UDP packets over the network), and for example the commandline tool `sflowtool` could be a +_Collector_ (it receives the UDP packets). + +## Recap: sFlow in VPP + +First, I have some pretty good news to report - our work on this plugin was +[[merged](https://gerrit.fd.io/r/c/vpp/+/41680)] and will be included in the VPP 25.02 release in a +few weeks! Last weekend, I gave a lightning talk at +[[FOSDEM](https://fosdem.org/2025/schedule/event/fosdem-2025-4196-vpp-monitoring-100gbps-with-sflow/)] +and caught up with a lot of community members and network- and software engineers. I had a great +time. + +in the dataplane low, we get both high performance, and a smaller probability of bugs causing harm. +And I do like simple implementations, as they tend to cause less _SIGSEGVs_. The architecture of the +end to end solution consists of three distinct parts, each with their own risk and performance +profile: + +{{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}} + +**sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves +packets from `device-input` to the `ethernet-input` nodes in its forwarding graph, the sFlow plugin +will inspect 1-in-N, taking a sample for further processing. Here, we don't try to be clever, simply +copy the `inIfIndex` and the first bytes of the ethernet frame, and append them to a +[[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))] queue. If too many samples +arrive, samples are dropped at the tail, and a counter incremented. This way, I can tell when the +dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to +get their fair share of samples into the Agent's hands. + +**sFlow main process**: There's a function running on the _main thread_, which shifts further +processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it +consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones +in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is +configurable), it'll grab all interface counters from those interfaces for which we have sFlow +turned on. VPP produces _Netlink_ messages and sends them to the kernel. + +**host-sflow**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_ +messages. It goes without saying that `hsflowd` is a battle-hardened implementation running on +hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy, +this module already exists. But Neil implements a _mod_vpp_ which can grab interface names and their +`ifIndex`, and counter statistics. VPP emits this data as _Netlink_ `USERSOCK` messages alongside +the PSAMPLEs. + + +By the way, I've written about _Netlink_ before when discussing the [[Linux Control Plane]({{< ref +2021-08-25-vpp-4 >}})] plugin. It's a mechanism for programs running in userspace to share +information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from +kernel to userspace using a _PSAMPLE_ Netlink channel. However, the pattern is that of a message +producer/subscriber queue and nothing precludes one userspace process (VPP) to be the producer while +another (hsflowd) is the consumer! + +Assuming the sFlow plugin in VPP produces samples and counters properly, `hsflowd` will do the rest, +giving correctness and upstream interoperability pretty much for free. That's slick! + +### VPP: sFlow Configuration + +The solution that I offer is based on two moving parts. First, the VPP plugin configuration, which +turns on sampling at a given rate on physical devices, also known as _hardware-interfaces_. Second, +the open source component [[host-sflow](https://github.com/sflow/host-sflow/releases)] can be +configured as of release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)]. + +I can configure VPP in three ways: + +***1. VPP Configuration via CLI*** + +``` +pim@vpp0-0:~$ vppctl +vpp0-0# sflow sampling-rate 100 +vpp0-0# sflow polling-interval 10 +vpp0-0# sflow header-bytes 128 +vpp0-0# sflow enable GigabitEthernet10/0/0 +vpp0-0# sflow enable GigabitEthernet10/0/0 disable +vpp0-0# sflow enable GigabitEthernet10/0/2 +vpp0-0# sflow enable GigabitEthernet10/0/3 +``` + +The first three commands set the global defaults - in my case I'm going to be sampling at 1:100 +which is unusually high frequency. A production setup may take 1: so for a +1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more +appropriate. The second command sets the interface stats polling interval. The default is to gather +these statistics every 20 seconds, but I set it to 10s here. Then, I instruct the plugin how many +bytes of the sampled ethernet frame should be taken. Common values are 64 and 128. I want enough +data to see the headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but +the contents of the payload are rarely interesting for statistics purposes. Finally, I can turn on +the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP, an idiomatic way to +turn on and off things is to have an enabler/disabler. It feels a bit clunky maybe to write `sflow +enable $iface disable` but it makes more logical sends if you parse that as "enable-disable" with +the default being the "enable" operation, and the alternate being the "disable" operation. + +***2. VPP Configuration via API*** + +I wrote a few API calls for the most common operations. Here's a snippet that shows the same calls +as from the CLI above, but in Python APIs: + +```python +from vpp_papi import VPPApiClient, VPPApiJSONFiles +import sys + +vpp_api_dir = VPPApiJSONFiles.find_api_dir([]) +vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir) +vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock") +vpp.connect("sflow-api-client") +print(vpp.api.show_version().version) +# Output: 25.06-rc0~14-g9b1c16039 + +vpp.api.sflow_sampling_rate_set(sampling_N=100) +print(vpp.api.sflow_sampling_rate_get()) +# Output: sflow_sampling_rate_get_reply(_0=655, context=3, sampling_N=100) + +vpp.api.sflow_polling_interval_set(polling_S=10) +print(vpp.api.sflow_polling_interval_get()) +# Output: sflow_polling_interval_get_reply(_0=661, context=5, polling_S=10) + +vpp.api.sflow_header_bytes_set(header_B=128) +print(vpp.api.sflow_header_bytes_get()) +# Output: sflow_header_bytes_get_reply(_0=665, context=7, header_B=128) + +vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=True) +vpp.api.sflow_enable_disable(hw_if_index=2, enable_disable=True) +print(vpp.api.sflow_interface_dump()) +# Output: [ sflow_interface_details(_0=667, context=8, hw_if_index=1), +# sflow_interface_details(_0=667, context=8, hw_if_index=2) ] + +print(vpp.api.sflow_interface_dump(hw_if_index=2)) +# Output: [ sflow_interface_details(_0=667, context=9, hw_if_index=2) ] + +print(vpp.api.sflow_interface_dump(hw_if_index=1234)) ## Invalid hw_if_index +# Output: [] + +vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=False) +print(vpp.api.sflow_interface_dump()) +# Output: [ sflow_interface_details(_0=667, context=10, hw_if_index=2) ] +``` + +This short program toys around a bit with the sFlow API. I first set the sampling to 1:100 and get +the current value. Then I set the polling interval to 10s and retrieve the current value again. +Finally, I set the header bytes to 128, and retrieve the value again. + +Adding and removing interfaces shows the idiom I mentioned before - the API being an +`enable_disable()` call of sorts, and typically taking a flag if the operator wants to enable (the +default), or disable sFlow on the interface. Getting the list of enabled interfaces can be done with +the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details` messages. + +I wrote an article showing the Python API and how it works in a fair amount of detail in a +[[previous article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests you. + +***3. VPPCfg YAML Configuration*** + +Writing on the CLI and calling the API is good and all, but many users of VPP have noticed that it +does not have any form of configuration persistence and that's on purpose. The project is a +programmable dataplane, and explicitly has left the programming and configuration as an exercise for +integrators. I have written a small Python program that takes a YAML file as input and uses it to +configure (and reconfigure, on the fly) the dataplane automatically. It's called +[[VPPcfg](https://github.com/pimvanpelt/vppcfg.git)], and I wrote some implementation thoughts on +its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2 +>}})] so I won't repeat that here. Instead, I will just show the configuration: + +``` +pim@vpp0-0:~$ cat << EOF > vppcfg.yaml +interfaces: + GigabitEthernet10/0/0: + sflow: true + GigabitEthernet10/0/1: + sflow: true + GigabitEthernet10/0/2: + sflow: true + GigabitEthernet10/0/3: + sflow: true + +sflow: + sampling-rate: 100 + polling-interval: 10 + header-bytes: 128 +EOF +pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp +[INFO ] root.main: Loading configfile vppcfg.yaml +[INFO ] vppcfg.config.valid_config: Configuration validated successfully +[INFO ] root.main: Configuration is valid +[INFO ] vppcfg.reconciler.write: Wrote 13 lines to /etc/vpp/config/vppcfg.vpp +[INFO ] root.main: Planning succeeded +pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp +``` + +The slick thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to +1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the `vppcfg plan` and `vppcfg +apply` stages and the VPP dataplane will reflect the newly declared configuration. + +### hsflowd: Configuration + +VPP will start to emit _Netlink_ messages, of type PSAMPLE with packet samples and of type USERSOCK +with the custom messages about the interface names and counters. These latter custom messages have +to be decoded, which is done by the _mod_vpp_ module, starting from release v2.11-5 +[[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)]. + +Here's a minimalist configuration: + +``` +pim@vpp0-0:~$ cat /etc/hsflowd.conf +sflow { + collector { ip=127.0.0.1 udpport=16343 } + collector { ip=192.0.2.1 namespace=dataplane } + psample { group=1 } + vpp { osIndex=off } +} +``` + +{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}} + +There are two important details that can be confusing at first: \ +**1.** kernel network namespaces \ +**2.** interface index namespaces + +#### hsflowd: Network namespace + +When started by systemd, `hsflowd` and VPP will normally both run in the _default_ network +namespace. Network namespaces virtualize Linux's network stack. On creation, a network namespace +contains only a loopback interface, and subsequently interfaces can be moved between namespaces. +Each network namespace will have its own set of IP addresses, its own routing table, socket listing, +connection tracking table, firewall, and other network-related resources. + +Given this, I can conclude that when the sFlow plugin opens a Netlink channel, it will +naturally do this in the network namespace that its VPP process is running in (the _default_ +namespace by default). It is therefore important that the recipient of all the Netlink messages, +notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them in a +different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other. + +It might pose a problem if the network connectivity lives in a different namespace than the default +one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface +pairs, _LIPs_, in a dataplane namespace. The main reason for doing this is to allow something like +FRR or Bird to completely govern the routing table in the kernel and keep it in-sync with the FIB in +VPP. In such a _dataplane_ network namespace, typically every interface is owned by VPP. + +Luckily, `hsflowd` can attach to one (default) namespace to get the PSAMPLEs, but create a socket in +a _different_ (dataplane) namespace to send packets to a collector. This explains the second +_collector_ entry in the config-file above. Here, `hsflowd` will send UDP packets to 192.0.2.1:6343 +from within the (VPP) dataplane namespace, and to 127.0.0.1:16343 in the default namespace. + +#### hsflowd: osIndex + +I hope the previous section made some sense, because this one will be a tad more esoteric. When +creating a network namespace, each interface will get its own uint32 interface index that identifies +it, and such an ID is typically called an `ifIndex`. It's important to note that the same number can +(and will!) occur multiple times, once for each namespace. Let me give you an example: + +``` +pim@summer:~$ ip link +1: lo: mtu 65536 qdisc noqueue state UNKNOWN ... + link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 +2: eno1: mtu 9000 qdisc mq master ipng-sl state UP ... + link/ether 00:22:19:6a:46:2e brd ff:ff:ff:ff:ff:ff + altname enp1s0f0 +3: eno2: mtu 900 qdisc mq master ipng-sl state DOWN ... + link/ether 00:22:19:6a:46:30 brd ff:ff:ff:ff:ff:ff + altname enp1s0f1 + +pim@summer:~$ ip netns exec dataplane ip link +1: lo: mtu 65536 qdisc noqueue state UNKNOWN ... + link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 +2: loop0: mtu 9216 qdisc mq state UP ... + link/ether de:ad:00:00:00:00 brd ff:ff:ff:ff:ff:ff +3: xe1-0: mtu 9216 qdisc mq state UP ... + link/ether 00:1b:21:bd:c7:18 brd ff:ff:ff:ff:ff:ff +``` + +I want to draw your attention to the number at the beginning of the line. In the _default_ +namespace, `ifIndex=3` corresponds to `ifName=eno2` (which has no link, it's marked `DOWN`). But in +the _dataplane_ namespace, that index corresponds to a completely different interface called +`ifName=xe1-0` (which is link `UP`). + +Now, let me show you the interfaces in VPP: + +``` +pim@summer:~$ vppctl show int | grep Gigabit | egrep 'Name|loop0|tap0|Gigabit' + Name Idx State MTU (L3/IP4/IP6/MPLS) +GigabitEthernet4/0/0 1 up 9000/0/0/0 +GigabitEthernet4/0/1 2 down 9000/0/0/0 +GigabitEthernet4/0/2 3 down 9000/0/0/0 +GigabitEthernet4/0/3 4 down 9000/0/0/0 +TenGigabitEthernet5/0/0 5 up 9216/0/0/0 +TenGigabitEthernet5/0/1 6 up 9216/0/0/0 +loop0 7 up 9216/0/0/0 +tap0 19 up 9216/0/0/0 +``` + +Here, I want you to look at the second column `Idx`, which shows what VPP calls the _sw_if_index_ +(the software interface index, as opposed to hardware index). Here, `ifIndex=3` corresponds to +`ifName=GigabitEthernet4/0/2`, which is neither `eno2` nor `xe1-0`. Oh my, yet _another_ namespace! + +So there are three (relevant) types of namespaces at play here: +1. ***Linux network*** namespace; here using `dataplane` and `default` each with overlapping numbering +1. ***VPP hardware*** interface namespace, also called PHYs (for physical interfaces). When VPP + first attaches to or creates fundamental network interfaces like from DPDK or RDMA, these will + create an _hw_if_index_ in a list. +1. ***VPP software*** interface namespace. Any interface (including hardware ones!) will also + receive a _sw_if_index_ in VPP. A good example is sub-interfaces: if I create a sub-int on + GigabitEthernet4/0/2, it will NOT get a hardware index, but it _will_ get the next available + software index (in this example, `sw_if_index=7`). + +In Linux CP, I can map one into the other, look at this: + +``` +pim@summer:~$ vppctl show lcp +lcp default netns dataplane +lcp lcp-auto-subint off +lcp lcp-sync on +lcp lcp-sync-unnumbered on +itf-pair: [0] loop0 tap0 loop0 2 type tap netns dataplane +itf-pair: [1] TenGigabitEthernet5/0/0 tap1 xe1-0 3 type tap netns dataplane +itf-pair: [2] TenGigabitEthernet5/0/1 tap2 xe1-1 4 type tap netns dataplane +itf-pair: [3] TenGigabitEthernet5/0/0.20 tap1.20 xe1-0.20 5 type tap netns dataplane +``` + +Those `itf-pair` describe our _LIPs_, and they have the coordinates to three things. 1) The VPP +software interface (VPP `ifName=loop0` with `sw_if_index=7`), which 2) Linux CP will mirror into the +Linux kernel using a TAP device (VPP `ifName=tap0` with `sw_if_index=19`). That TAP has one leg in +VPP (`tap0`), and another in 3) Linux (with `ifName=loop` and `ifIndex=2` in namespace `dataplane`). + +> So the tuple that fully describes a _LIP_ is `{7, 19,'dataplane', 2}` + +Climbing back out of that rabbit hole, I am now finally ready to explain the feature. When sFlow in +VPP takes its sample, it will be doing this on a PHY, that is a given interface with a specific +_hw_if_index_. When it polls the counters, it'll do it for that specific _hw_if_index_. It now has a +choice: should it share with the world the representation of *its* namespace, or should it try to be +smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the +plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, look up a _LIP_ +with it. If it finds one, it'll know both the namespace in which it lives as well as the osIndex in +that namespace. If it doesn't, it will at least have the _sw_if_index_ at hand, so it'll annotate the +USERSOCK with this information. + +Now, `hsflowd` has a choice to make: does it share the Linux representation and hide VPP as an +implementation detail? Or does it share the VPP dataplane _sw_if_index_? There are use cases +relevant to both, so the decision was to let the operator decide, by setting `osIndex` either `on` +(use Linux ifIndex) or `off` (use VPP sw_if_index). + +### hsflowd: Host Counters + +Now that I understand the configuration parts of VPP and `hsflowd`, I decide to configure everything +but I don't turn on any interfaces yet in VPP. Once I start the daemon, I can see that it +sends an UDP packet every 30 seconds to the configured _collector_: + +``` +pim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n +tcpdump: verbose output suppressed, use -v[v]... for full protocol decode +listening on lo, link-type EN10MB (Ethernet), snapshot length 9000 bytes +15:34:19.695042 IP 127.0.0.1.48753 > 127.0.0.1.6343: sFlowv5, + IPv4 agent 198.19.5.16, agent-id 100000, length 716 +``` + +The `tcpdump` I have on my Debian bookworm machines doesn't know how to decode the contents of these +sFlow packets. Actually, neither does Wireshark. I've attached a file of these mysterious packets +[[sflow-host.pcap](/assets/sflow/sflow-host.pcap)] in case you want to take a look. +Neil however gives me a tip. A full message decoder and otherwise handy Swiss army knife lives in +[[sflowtool](https://github.com/sflow/sflowtool)]. + +I can offer this pcap file to `sflowtool`, or let it just listen on the UDP port directly, and +it'll tell me what it finds: + +``` +pim@vpp0-0:~$ sflowtool -p 6343 +startDatagram ================================= +datagramSourceIP 127.0.0.1 +datagramSize 716 +unixSecondsUTC 1739112018 +localtime 2025-02-09T15:40:18+0100 +datagramVersion 5 +agentSubId 100000 +agent 198.19.5.16 +packetSequenceNo 57 +sysUpTime 987398 +samplesInPacket 1 +startSample ---------------------- +sampleType_tag 0:4 +sampleType COUNTERSSAMPLE +sampleSequenceNo 33 +sourceId 2:1 +counterBlock_tag 0:2001 +adaptor_0_ifIndex 2 +adaptor_0_MACs 1 +adaptor_0_MAC_0 525400f00100 +counterBlock_tag 0:2010 +udpInDatagrams 123904 +udpNoPorts 23132459 +udpInErrors 0 +udpOutDatagrams 46480629 +udpRcvbufErrors 0 +udpSndbufErrors 0 +udpInCsumErrors 0 +counterBlock_tag 0:2009 +tcpRtoAlgorithm 1 +tcpRtoMin 200 +tcpRtoMax 120000 +tcpMaxConn 4294967295 +tcpActiveOpens 0 +tcpPassiveOpens 30 +tcpAttemptFails 0 +tcpEstabResets 0 +tcpCurrEstab 1 +tcpInSegs 89120 +tcpOutSegs 86961 +tcpRetransSegs 59 +tcpInErrs 0 +tcpOutRsts 4 +tcpInCsumErrors 0 +counterBlock_tag 0:2008 +icmpInMsgs 23129314 +icmpInErrors 32 +icmpInDestUnreachs 0 +icmpInTimeExcds 23129282 +icmpInParamProbs 0 +icmpInSrcQuenchs 0 +icmpInRedirects 0 +icmpInEchos 0 +icmpInEchoReps 32 +icmpInTimestamps 0 +icmpInAddrMasks 0 +icmpInAddrMaskReps 0 +icmpOutMsgs 0 +icmpOutErrors 0 +icmpOutDestUnreachs 23132467 +icmpOutTimeExcds 0 +icmpOutParamProbs 23132467 +icmpOutSrcQuenchs 0 +icmpOutRedirects 0 +icmpOutEchos 0 +icmpOutEchoReps 0 +icmpOutTimestamps 0 +icmpOutTimestampReps 0 +icmpOutAddrMasks 0 +icmpOutAddrMaskReps 0 +counterBlock_tag 0:2007 +ipForwarding 2 +ipDefaultTTL 64 +ipInReceives 46590552 +ipInHdrErrors 0 +ipInAddrErrors 0 +ipForwDatagrams 0 +ipInUnknownProtos 0 +ipInDiscards 0 +ipInDelivers 46402357 +ipOutRequests 69613096 +ipOutDiscards 0 +ipOutNoRoutes 80 +ipReasmTimeout 0 +ipReasmReqds 0 +ipReasmOKs 0 +ipReasmFails 0 +ipFragOKs 0 +ipFragFails 0 +ipFragCreates 0 +counterBlock_tag 0:2005 +disk_total 6253608960 +disk_free 2719039488 +disk_partition_max_used 56.52 +disk_reads 11512 +disk_bytes_read 626214912 +disk_read_time 48469 +disk_writes 1058955 +disk_bytes_written 8924332032 +disk_write_time 7954804 +counterBlock_tag 0:2004 +mem_total 8326963200 +mem_free 5063872512 +mem_shared 0 +mem_buffers 86425600 +mem_cached 827752448 +swap_total 0 +swap_free 0 +page_in 306365 +page_out 4357584 +swap_in 0 +swap_out 0 +counterBlock_tag 0:2003 +cpu_load_one 0.030 +cpu_load_five 0.050 +cpu_load_fifteen 0.040 +cpu_proc_run 1 +cpu_proc_total 138 +cpu_num 2 +cpu_speed 1699 +cpu_uptime 1699306 +cpu_user 64269210 +cpu_nice 1810 +cpu_system 34690140 +cpu_idle 3234293560 +cpu_wio 3568580 +cpuintr 0 +cpu_sintr 5687680 +cpuinterrupts 1596621688 +cpu_contexts 3246142972 +cpu_steal 329520 +cpu_guest 0 +cpu_guest_nice 0 +counterBlock_tag 0:2006 +nio_bytes_in 250283 +nio_pkts_in 2931 +nio_errs_in 0 +nio_drops_in 0 +nio_bytes_out 370244 +nio_pkts_out 1640 +nio_errs_out 0 +nio_drops_out 0 +counterBlock_tag 0:2000 +hostname vpp0-0 +UUID ec933791-d6af-7a93-3b8d-aab1a46d6faa +machine_type 3 +os_name 2 +os_release 6.1.0-26-amd64 +endSample ---------------------- +endDatagram ================================= +``` + +If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might I +agree with you. But it is really cool to see that every 30 seconds, the _collector_ will receive +this form of heartbeat from the _agent_. There's a lot of vitalsigns in this packet, including some +non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version +information. It's super dope! + +### hsflowd: Interface Counters + +Next, I'll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to +something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed, +every ten seconds or so I get a few packets, which I captured them in +[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Some of the packets contain only one +counter record, while others contain more than one (in the PCAP, packet #9 has two). If I update the +polling-interval to every second, I can see that most of the packets have four counters. + +Those interface counters, as decoded by `sflowtool`, look like this: + +``` +pim@vpp0-0:~$ sflowtool -r sflow-interface.pcap | \ + awk '/startSample/ { on=1 } { if (on) { print $0 } } /endSample/ { on=0 }' +startSample ---------------------- +sampleType_tag 0:4 +sampleType COUNTERSSAMPLE +sampleSequenceNo 745 +sourceId 0:3 +counterBlock_tag 0:1005 +ifName GigabitEthernet10/0/2 +counterBlock_tag 0:1 +ifIndex 3 +networkType 6 +ifSpeed 0 +ifDirection 1 +ifStatus 3 +ifInOctets 858282015 +ifInUcastPkts 780540 +ifInMulticastPkts 0 +ifInBroadcastPkts 0 +ifInDiscards 0 +ifInErrors 0 +ifInUnknownProtos 0 +ifOutOctets 1246716016 +ifOutUcastPkts 975772 +ifOutMulticastPkts 0 +ifOutBroadcastPkts 0 +ifOutDiscards 127 +ifOutErrors 28 +ifPromiscuousMode 0 +endSample ---------------------- +``` + +What I find particularly cool about it, is that sFlow provides an automatic mapping between the +`ifName=GigabitEthernet10/0/2` (tag 0:1005), together and an object (tag 0:1), which contains the +`ifIndex=3`, and lots of packet and octet counters both in the ingress and egress direction. This is +super useful for upstream _collectors_, as they can now find the hostname, agent name and address, +and the correlation between interface names and their indexes. Noice! + +#### hsflowd: Packet Samples + +Now it's time to ratchet up the packet sampling, so I move it from 1:100M to 1:1000, while keeping +the interface polling-interval at 10 seconds and I ask VPP to sample 64 bytes of each packet that it +inspects. On either side of my pet VPP instance, I start an `iperf3` run to generate some traffic. I +now see a healthy stream of sflow packets coming in on port 6343. They still contain every 30 +seconds or so a host counter, and every 10 seconds a set of interface counters come by, but mostly +these UDP packets are showing me samples. I've captured a few minutes of these in +[[sflow-all.pcap](/assets/sflow/sflow-all.pcap)]. +Although Wireshark doesn't know how to interpret the sFlow counter messages, it _does_ know how to +interpret the sFlow sample messagess, and it reveals one of them like this: + +{{< image width="100%" src="/assets/sflow/sflow-wireshark.png" alt="sFlow Wireshark" >}} + +Let me take a look at the picture from top to bottom. First, the outer header (from 127.0.0.1:48753 +to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as +having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it's going to +send 9 samples, the first of which says it's from ifIndex=2 and at a sampling rate of 1:1000. It +then shows the sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those +are sampled. Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and +DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my `iperf3`, +booyah! + +### VPP: sFlow Performance + +{{< image float="right" src="/assets/sflow/sflow-lab.png" alt="sFlow Lab" width="20em" >}} + +One question I get a lot about this plugin is: what is the performance impact when using +sFlow? I spent a considerable amount of time tinkering with this, and together with Neil bringing +the plugin to what we both agree is the most efficient use of CPU. We could go a bit further, but +that would require somewhat intrusive changes to VPP's internals and as _North of the Border_ would +say: what we have isn't just good, it's good enough! + +I've built a small testbed based on two Dell R730 machines. On the left, I have a Debian machine +running Cisco T-Rex using four quad-tengig network cards, the classic Intel i710-DA4. On the right, +I have my VPP machine called _Hippo_ (because it's always hungry for packets), with the same +hardware. I'll build two halves. On the top NIC (Te3/0/0-3 in VPP), I will install IPv4 and MPLS +forwarding on the purple circuit, and a simple Layer2 cross connect on the cyan circuit. On all four +interfaces, I will enable sFlow. Then, I will mirror this configuration on the bottom NIC +(Te130/0/0-3) in the red and green circuits, for which I will leave sFlow turned off. + +To help you reproduce my results, and under the assumption that this is your jam, here's the +configuration for all of the kit: + +***0. Cisco T-Rex*** +``` +pim@trex:~ $ cat /srv/trex/8x10.yaml +- version: 2 + interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ] + port_info: + - src_mac: 00:1b:21:06:00:00 + dest_mac: 9c:69:b4:61:a1:dc # Connected to Hippo Te3/0/0, purple + - src_mac: 00:1b:21:06:00:01 + dest_mac: 9c:69:b4:61:a1:dd # Connected to Hippo Te3/0/1, purple + - src_mac: 00:1b:21:83:00:00 + dest_mac: 00:1b:21:83:00:01 # L2XC via Hippo Te3/0/2, cyan + - src_mac: 00:1b:21:83:00:01 + dest_mac: 00:1b:21:83:00:00 # L2XC via Hippo Te3/0/3, cyan + + - src_mac: 00:1b:21:87:00:00 + dest_mac: 9c:69:b4:61:75:d0 # Connected to Hippo Te130/0/0, red + - src_mac: 00:1b:21:87:00:01 + dest_mac: 9c:69:b4:61:75:d1 # Connected to Hippo Te130/0/1, red + - src_mac: 9c:69:b4:85:00:00 + dest_mac: 9c:69:b4:85:00:01 # L2XC via Hippo Te130/0/2, green + - src_mac: 9c:69:b4:85:00:01 + dest_mac: 9c:69:b4:85:00:00 # L2XC via Hippo Te130/0/3, green +pim@trex:~ $ sudo t-rex-64 -i -c 4 --cfg /srv/trex/8x10.yaml +``` + +When constructing the T-Rex configuration, I specifically set the destination MAC address for L3 +circuits (the purple and red ones) using Hippo's interface MAC address, which I can find with +`vppctl show hardware-interfaces`. This way, T-Rex does not have to ARP for the VPP endpoint. On +L2XC circuits (the cyan and green ones), VPP does not concern itself with the MAC addressing at +all. It puts its interface in _promiscuous_ mode, and simply writes out any ethernet frame received, +directly to the egress interface. + +***1. IPv4*** +``` +hippo# set int state TenGigabitEthernet3/0/0 up +hippo# set int state TenGigabitEthernet3/0/1 up +hippo# set int state TenGigabitEthernet130/0/0 up +hippo# set int state TenGigabitEthernet130/0/1 up +hippo# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31 +hippo# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31 +hippo# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31 +hippo# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31 +hippo# ip route add 16.0.0.0/24 via 100.64.0.0 +hippo# ip route add 48.0.0.0/24 via 100.64.1.0 +hippo# ip route add 16.0.2.0/24 via 100.64.4.0 +hippo# ip route add 48.0.2.0/24 via 100.64.5.0 +hippo# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static +hippo# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static +hippo# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static +hippo# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static +``` + +By the way, one note to this last piece, I'm setting static IPv4 neighbors so that Cisco T-Rex +as well as VPP do not have to use ARP to resolve each other. You'll see above that the T-Rex +configuration also uses MAC addresses exclusively. Setting the `ip neighbor` like this allows VPP +to know where to send return traffic. + +***2. MPLS*** +``` +hippo# mpls table add 0 +hippo# set interface mpls TenGigabitEthernet3/0/0 enable +hippo# set interface mpls TenGigabitEthernet3/0/1 enable +hippo# set interface mpls TenGigabitEthernet130/0/0 enable +hippo# set interface mpls TenGigabitEthernet130/0/1 enable +hippo# mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17 +hippo# mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16 +hippo# mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21 +hippo# mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20 +``` + +Here, the MPLS configuration implements a simple P-router, where incoming MPLS packets with label 16 +will be sent back to T-Rex on Te3/0/1 to the specified IPv4 nexthop (for which we already know the +MAC address), and with label 16 removed and new label 17 imposed, in other words a SWAP operation. + +***3. L2XC*** +``` +hippo# set int state TenGigabitEthernet3/0/2 up +hippo# set int state TenGigabitEthernet3/0/3 up +hippo# set int state TenGigabitEthernet130/0/2 up +hippo# set int state TenGigabitEthernet130/0/3 up +hippo# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3 +hippo# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2 +hippo# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3 +hippo# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2 +``` + +I've added a layer2 cross connect as well because it's computationally very cheap for VPP to receive +an L2 (ethernet) datagram, and immediately transmit it on another interface. There's no FIB lookup +and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as +fast as it can! + +Here's how a loadtest looks like when sending 80Gbps at 256b packets on all eight interfaces: + +{{< image src="/assets/sflow/sflow-lab-trex.png" alt="sFlow T-Rex" width="100%" >}} + +The leftmost ports p0 <-> p1 are sending IPv4+MPLS, while ports p0 <-> p2 are sending ethernet back +and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These +four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 <-> p4 and p5 <-> p6 +respectively have sFlow turned off but with the same configuration. They are my control, showing +the CPU use without sFLow. + +**First conclusion**: This stuff works a treat. There is absolutely no impact of throughput at +80Gbps with 47.6Mpps either _with_, or _without_ sFlow turned on. That's wonderful news, as it shows +that the dataplane has more CPU available than is needed for any combination of functionality. + +But what _is_ the limit? For this, I'll take a deeper look at the runtime statistics by varying the +CPU time spent and maximum throughput achievable on a single VPP worker, thus using a single CPU +thread on this Hippo machine that has 44 cores and 44 hyperthreads. I switch the loadtester to emit +64 byte ethernet packets, the smallest I'm allowed to send. + +| Loadtest | no sFlow | 1:1'000'000 | 1:10'000 | 1:1'000 | 1:100 | +|-------------|-----------|-----------|-----------|-----------|-----------| +| L2XC | 14.88Mpps | 14.32Mpps | 14.31Mpps | 14.27Mpps | 14.15Mpps | +| IPv4 | 10.89Mpps | 9.88Mpps | 9.88Mpps | 9.84Mpps | 9.73Mpps | +| MPLS | 10.11Mpps | 9.52Mpps | 9.52Mpps | 9.51Mpps | 9.45Mpps | +| ***sFlow Packets*** / 10sec | N/A | 337.42M total | 337.39M total | 336.48M total | 333.64M total | +| .. Sampled |   | 328 | 33.8k | 336k | 3.34M | +| .. Sent |   | 328 | 33.8k | 336k | 1.53M | +| .. Dropped |   | 0 | 0 | 0 | 1.81M | + +Here I can make a few important observations. + +**Baseline**: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which +implies that it has a little bit of CPU left over to do other work, if needed. With IPv4, I can see +that the throughput is CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I +know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The +total capacity is 10.11Mpps when sFlow is turned off. + +**Overhead**: If I turn on sFLow on the interface, VPP will insert the _sflow-node_ into the +dataplane graph between `device-input` and `ethernet-input`. It means that the sFlow node will see +_every single_ packet, and it will have to move all of these into the next node, which costs about +9.5 CPU cycles per packet. The regression on L2XC is 3.8% but I have to note that VPP was not CPU +bound on the L2XC so it used some CPU cycles which were still available, before regressing +throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS. + +**Sampling Cost**: When then doing higher rates of sampling, the further regression is not _that_ +terrible. Between 1:1'000'000 and 1:10'000, there's barely a noticeable difference. Even in the +worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%. The +regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS). +Of course, by using multiple hardware receive queues and multiple RX workers per interface, the cost +can be kept well in hand. + +**Overload Protection**: At 1:1'000 and an effective rate of 33.65Mpps across all ports, I correctly +observe 336k samples taken, and sent to PSAMPLE. At 1:100 however, there are 3.34M samples, but +they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream +`sflow-main` thread and `hsflowd`. I can see that here, 1.81M samples have been dropped, while 1.53M +samples made it through. By the way, this means VPP is happily sending 153K samples/sec to the +collector! + +## What's Next + +Now that I've seen the UDP packets from our agent to a collector on the wire, and also how +incredibly efficient the sFlow sampling implementation turned out, I'm super motivated to +continue the journey with higher level collector receivers like ntopng, sflow-rt or Akvorado. In an +upcoming article, I'll describe how I rolled out Akvorado at IPng, and what types of changes would +make the user experience even better (or simpler to understand, at least). + +### Acknowledgements + +I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the +finer details such as logging, error handling, API specifications, and documentation. He has been a +true pleasure to work with and learn from. Also, thank you to the VPP developer community, notably +Benoit, Florin, Damjan, Dave and Matt, for helping with the review and getting this thing merged in +time for the 25.02 release. diff --git a/static/assets/sflow/sflow-all.pcap b/static/assets/sflow/sflow-all.pcap new file mode 100644 index 0000000..1c098bc Binary files /dev/null and b/static/assets/sflow/sflow-all.pcap differ diff --git a/static/assets/sflow/sflow-host.pcap b/static/assets/sflow/sflow-host.pcap new file mode 100644 index 0000000..e1998e2 Binary files /dev/null and b/static/assets/sflow/sflow-host.pcap differ diff --git a/static/assets/sflow/sflow-interface.pcap b/static/assets/sflow/sflow-interface.pcap new file mode 100644 index 0000000..8853c98 Binary files /dev/null and b/static/assets/sflow/sflow-interface.pcap differ diff --git a/static/assets/sflow/sflow-lab-trex.png b/static/assets/sflow/sflow-lab-trex.png new file mode 100644 index 0000000..46d88df --- /dev/null +++ b/static/assets/sflow/sflow-lab-trex.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:418875c986818275931b2abdc037cf44b07a18d0a5ffb9836b13f7f4b6a7c721 +size 391520 diff --git a/static/assets/sflow/sflow-lab.png b/static/assets/sflow/sflow-lab.png new file mode 100644 index 0000000..fc42151 --- /dev/null +++ b/static/assets/sflow/sflow-lab.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ad73bc6cef0f4c7f127cf32eae1a110e5fdd9c893bae7d44d008ed1384604c86 +size 116247 diff --git a/static/assets/sflow/sflow-overview.png b/static/assets/sflow/sflow-overview.png new file mode 100644 index 0000000..4f69409 --- /dev/null +++ b/static/assets/sflow/sflow-overview.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a26f50f2c08c791a78cfe203f803890519538c2bb665b34515a7ea7a0ebfb4bd +size 89627 diff --git a/static/assets/sflow/sflow-vpp-overview.png b/static/assets/sflow/sflow-vpp-overview.png new file mode 100644 index 0000000..1331cd4 --- /dev/null +++ b/static/assets/sflow/sflow-vpp-overview.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c0df740038cdd7550b3b03ae65a68f93e0799d4807ebf4b17512af8ab61dead6 +size 231622 diff --git a/static/assets/sflow/sflow-wireshark.png b/static/assets/sflow/sflow-wireshark.png new file mode 100644 index 0000000..6ab179e --- /dev/null +++ b/static/assets/sflow/sflow-wireshark.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:aedc142bb1e22aaccf30daf7f235cf15bec49f3c0f8737680fc0dd13ad05d4c9 +size 335282