This commit is contained in:
		
							
								
								
									
										849
									
								
								content/articles/2025-02-08-sflow-3.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										849
									
								
								content/articles/2025-02-08-sflow-3.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,849 @@ | ||||
| --- | ||||
| date: "2025-02-08T07:51:23Z" | ||||
| title: 'VPP with sFlow - Part 3' | ||||
| --- | ||||
|  | ||||
| # Introduction | ||||
|  | ||||
| {{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width="12em" >}} | ||||
|  | ||||
| In the second half of last year, I picked up a project together with Neil McKee of | ||||
| [[inMon](https://inmon.com/)], the care takers of [[sFlow](https://sflow.org)]: an industry standard | ||||
| technology for monitoring high speed networks. `sFlow` gives complete visibility into the | ||||
| use of networks enabling performance optimization, accounting/billing for usage, and defense against | ||||
| security threats. | ||||
|  | ||||
| The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it | ||||
| forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and | ||||
| [[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the | ||||
| so called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for | ||||
| a small portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but | ||||
| also in the VPP software dataplane. The agent then _transmits_ these samples using a Linux kernel | ||||
| feature called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)]. | ||||
| This greatly reduces the complexity of code to be implemented in the forwarding path, while at the | ||||
| same time bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business | ||||
| logic for the more complex state keeping, packet marshalling and transmission from the _Agent_ to a | ||||
| central _Collector_. | ||||
|  | ||||
| In this third article, I wanted to spend some time discussing how samples make their way out of the | ||||
| VPP dataplane, and into higher level tools. | ||||
|  | ||||
| ## Recap: sFlow | ||||
|  | ||||
| {{< image float="left" src="/assets/sflow/sflow-overview.png" alt="sFlow Overview" width="14em" >}} | ||||
|  | ||||
| sFlow describes a method for Monitoring Traffic in Switched/Routed Networks, originally described in | ||||
| [[RFC3176](https://datatracker.ietf.org/doc/html/rfc3176)]. The current specification is version 5 | ||||
| and is homed on the sFlow.org website [[ref](https://sflow.org/sflow_version_5.txt)]. Typically, a | ||||
| Switching ASIC in the dataplane (seen at the bottom of the diagram to the left) is asked to copy | ||||
| 1-in-N packets to local sFlow Agent. | ||||
|  | ||||
| **Samples**: The agent will copy the first N bytes (typically 128) of the packet into a sample. As | ||||
| the ASIC knows which interface the packet was received on, the `inIfIndex` will be added. After the | ||||
| egress port(s) are found in an L2 FIB, or a next hop (and port) is found after a routing decision, | ||||
| the ASIC can annotate the sample with this `outIfIndex` and `DstMAC` metadata as well. | ||||
|  | ||||
| **Drop Monitoring**: There's one rather clever insight that sFlow gives: what if the packet _was | ||||
| not_ routed or switched, but rather discarded?  For this, sFlow is able to describe the reason for | ||||
| the drop. For example, the ASIC receive queue could have been overfull, or it did not find a | ||||
| destination to forward the packet to (no FIB entry), perhaps it was instructed by an ACL to drop the | ||||
| packet or maybe even tried to transmit the packet but the physical datalink layer had to abandon the | ||||
| transmission for whatever reason (link down, TX queue full, link saturation, and so on). It's hard | ||||
| to overstate how important it is to have this so-called _drop monitoring_, as operators often spend | ||||
| hours and hours figuring out _why_ packets are lost their network or datacenter switching fabric. | ||||
|  | ||||
| **Metadata**: The agent may have other metadata as well, such as which prefix was the source and | ||||
| destination of the packet, what additional RIB information do we have (AS path, BGP communities, and | ||||
| so on).  This may be added to the sample record as well. | ||||
|  | ||||
| **Counters**: Since we're doing sampling of 1:N packets, we can estimate total traffic in a | ||||
| reasonably accurate way. Peter and Sonia wrote a succint | ||||
| [[paper](https://sflow.org/packetSamplingBasics/)] about the math, so I won't get into that here. | ||||
| Mostly because I am but a software engineer, not a statistician... :) However, I will say this: if we | ||||
| sample a fraction of the traffic but know how many bytes and packets we saw in total, we can provide | ||||
| an overview with a quantifiable accuracy. This is why the Agent will periodically get the interface | ||||
| counters from the ASIC. | ||||
|  | ||||
| **Collector**: One or more samples can be concatenated into UDP messages that go from the Agent to a | ||||
| central _sFlow Collector_. The heavy lifting in analysis is done upstream from the switch or router, | ||||
| which is great for performance.  Many thousands or even tens of thousands of agents can forward | ||||
| their samples and interface counters to a single central collector, which in turn can be used to | ||||
| draw up a near real time picture of the state of traffic through even the largest of ISP networks or | ||||
| datacenter switch fabrics. | ||||
|  | ||||
| In sFlow parlance [[VPP](https://fd.io/)] and its companion `hsflowd` is an _Agent_ (it sends the | ||||
| UDP packets over the network), and for example the commandline tool `sflowtool` could be a | ||||
| _Collector_ (it receives the UDP packets). | ||||
|  | ||||
| ## Recap: sFlow in VPP | ||||
|  | ||||
| First, I have some pretty good news to report - our work on this plugin was | ||||
| [[merged](https://gerrit.fd.io/r/c/vpp/+/41680)] and will be included in the VPP 25.02 release in a | ||||
| few weeks! Last weekend, I gave a lightning talk at | ||||
| [[FOSDEM](https://fosdem.org/2025/schedule/event/fosdem-2025-4196-vpp-monitoring-100gbps-with-sflow/)] | ||||
| and caught up with a lot of community members and network- and software engineers. I had a great | ||||
| time. | ||||
|  | ||||
| in the dataplane low, we get both high performance, and a smaller probability of bugs causing harm. | ||||
| And I do like simple implementations, as they tend to cause less _SIGSEGVs_. The architecture of the | ||||
| end to end solution consists of three distinct parts, each with their own risk and performance | ||||
| profile: | ||||
|  | ||||
| {{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}} | ||||
|  | ||||
| **sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves | ||||
| packets from `device-input` to the `ethernet-input` nodes in its forwarding graph, the sFlow plugin | ||||
| will inspect 1-in-N, taking a sample for further processing. Here, we don't try to be clever, simply | ||||
| copy the `inIfIndex` and the first bytes of the ethernet frame, and append them to a | ||||
| [[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))] queue. If too many samples | ||||
| arrive, samples are dropped at the tail, and a counter incremented.  This way, I can tell when the | ||||
| dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to | ||||
| get their fair share of samples into the Agent's hands. | ||||
|  | ||||
| **sFlow main process**: There's a function running on the _main thread_, which shifts further | ||||
| processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it | ||||
| consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones | ||||
| in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is | ||||
| configurable), it'll grab all interface counters from those interfaces for which we have sFlow | ||||
| turned on. VPP produces _Netlink_ messages and sends them to the kernel. | ||||
|  | ||||
| **host-sflow**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_ | ||||
| messages. It goes without saying that `hsflowd` is a battle-hardened implementation running on | ||||
| hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy, | ||||
| this module already exists. But Neil implements a _mod_vpp_ which can grab interface names and their | ||||
| `ifIndex`, and counter statistics. VPP emits this data as _Netlink_ `USERSOCK` messages alongside | ||||
| the PSAMPLEs. | ||||
|  | ||||
|  | ||||
| By the way, I've written about _Netlink_ before when discussing the [[Linux Control Plane]({{< ref | ||||
| 2021-08-25-vpp-4 >}})] plugin.  It's a mechanism for programs running in userspace to share | ||||
| information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from | ||||
| kernel to userspace using a _PSAMPLE_ Netlink channel. However, the pattern is that of a message | ||||
| producer/subscriber queue and nothing precludes one userspace process (VPP) to be the producer while | ||||
| another (hsflowd) is the consumer! | ||||
|  | ||||
| Assuming the sFlow plugin in VPP produces samples and counters properly, `hsflowd` will do the rest, | ||||
| giving correctness and upstream interoperability pretty much for free. That's slick! | ||||
|  | ||||
| ### VPP: sFlow Configuration | ||||
|  | ||||
| The solution that I offer is based on two moving parts. First, the VPP plugin configuration, which | ||||
| turns on sampling at a given rate on physical devices, also known as _hardware-interfaces_. Second, | ||||
| the open source component [[host-sflow](https://github.com/sflow/host-sflow/releases)] can be | ||||
| configured as of release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)]. | ||||
|  | ||||
| I can configure VPP in three ways: | ||||
|  | ||||
| ***1. VPP Configuration via CLI*** | ||||
|  | ||||
| ``` | ||||
| pim@vpp0-0:~$ vppctl | ||||
| vpp0-0# sflow sampling-rate 100 | ||||
| vpp0-0# sflow polling-interval 10 | ||||
| vpp0-0# sflow header-bytes 128 | ||||
| vpp0-0# sflow enable GigabitEthernet10/0/0 | ||||
| vpp0-0# sflow enable GigabitEthernet10/0/0 disable | ||||
| vpp0-0# sflow enable GigabitEthernet10/0/2 | ||||
| vpp0-0# sflow enable GigabitEthernet10/0/3 | ||||
| ``` | ||||
|  | ||||
| The first three commands set the global defaults - in my case I'm going to be sampling at 1:100 | ||||
| which is unusually high frequency. A production setup may take 1:<linkspeed-in-megabits> so for a | ||||
| 1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more | ||||
| appropriate. The second command sets the interface stats polling interval. The default is to gather | ||||
| these statistics every 20 seconds, but I set it to 10s here. Then, I instruct the plugin how many | ||||
| bytes of the sampled ethernet frame should be taken. Common values are 64 and 128. I want enough | ||||
| data to see the headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but | ||||
| the contents of the payload are rarely interesting for statistics purposes. Finally, I can turn on | ||||
| the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP, an idiomatic way to | ||||
| turn on and off things is to have an enabler/disabler. It feels a bit clunky maybe to write `sflow | ||||
| enable $iface disable` but it makes more logical sends if you parse that as "enable-disable" with | ||||
| the default being the "enable" operation, and the alternate being the "disable" operation. | ||||
|  | ||||
| ***2. VPP Configuration via API*** | ||||
|  | ||||
| I wrote a few API calls for the most common operations. Here's a snippet that shows the same calls | ||||
| as from the CLI above, but in Python APIs: | ||||
|  | ||||
| ```python | ||||
| from vpp_papi import VPPApiClient, VPPApiJSONFiles | ||||
| import sys | ||||
|  | ||||
| vpp_api_dir = VPPApiJSONFiles.find_api_dir([]) | ||||
| vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir) | ||||
| vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock") | ||||
| vpp.connect("sflow-api-client") | ||||
| print(vpp.api.show_version().version) | ||||
| # Output: 25.06-rc0~14-g9b1c16039 | ||||
|  | ||||
| vpp.api.sflow_sampling_rate_set(sampling_N=100) | ||||
| print(vpp.api.sflow_sampling_rate_get()) | ||||
| # Output: sflow_sampling_rate_get_reply(_0=655, context=3, sampling_N=100) | ||||
|  | ||||
| vpp.api.sflow_polling_interval_set(polling_S=10) | ||||
| print(vpp.api.sflow_polling_interval_get()) | ||||
| # Output: sflow_polling_interval_get_reply(_0=661, context=5, polling_S=10) | ||||
|  | ||||
| vpp.api.sflow_header_bytes_set(header_B=128) | ||||
| print(vpp.api.sflow_header_bytes_get()) | ||||
| # Output: sflow_header_bytes_get_reply(_0=665, context=7, header_B=128) | ||||
|  | ||||
| vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=True) | ||||
| vpp.api.sflow_enable_disable(hw_if_index=2, enable_disable=True) | ||||
| print(vpp.api.sflow_interface_dump()) | ||||
| # Output: [ sflow_interface_details(_0=667, context=8, hw_if_index=1), | ||||
| #           sflow_interface_details(_0=667, context=8, hw_if_index=2) ] | ||||
|  | ||||
| print(vpp.api.sflow_interface_dump(hw_if_index=2)) | ||||
| # Output: [ sflow_interface_details(_0=667, context=9, hw_if_index=2) ] | ||||
|  | ||||
| print(vpp.api.sflow_interface_dump(hw_if_index=1234)) ## Invalid hw_if_index | ||||
| # Output: [] | ||||
|  | ||||
| vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=False) | ||||
| print(vpp.api.sflow_interface_dump()) | ||||
| # Output: [ sflow_interface_details(_0=667, context=10, hw_if_index=2) ] | ||||
| ``` | ||||
|  | ||||
| This short program toys around a bit with the sFlow API. I first set the sampling to 1:100 and get | ||||
| the current value. Then I set the polling interval to 10s and retrieve the current value again. | ||||
| Finally, I set the header bytes to 128, and retrieve the value again.  | ||||
|  | ||||
| Adding and removing interfaces shows the idiom I mentioned before - the API being an | ||||
| `enable_disable()` call of sorts, and typically taking a flag if the operator wants to enable (the | ||||
| default), or disable sFlow on the interface. Getting the list of enabled interfaces can be done with | ||||
| the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details` messages. | ||||
|  | ||||
| I wrote an article showing the Python API and how it works in a fair amount of detail in a | ||||
| [[previous article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests you. | ||||
|  | ||||
| ***3. VPPCfg YAML Configuration*** | ||||
|  | ||||
| Writing on the CLI and calling the API is good and all, but many users of VPP have noticed that it | ||||
| does not have any form of configuration persistence and that's on purpose. The project is a | ||||
| programmable dataplane, and explicitly has left the programming and configuration as an exercise for | ||||
| integrators. I have written a small Python program that takes a YAML file as input and uses it to | ||||
| configure (and reconfigure, on the fly) the dataplane automatically. It's called | ||||
| [[VPPcfg](https://github.com/pimvanpelt/vppcfg.git)], and I wrote some implementation thoughts on | ||||
| its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2 | ||||
| >}})] so I won't repeat that here. Instead, I will just show the configuration: | ||||
|  | ||||
| ``` | ||||
| pim@vpp0-0:~$ cat << EOF > vppcfg.yaml | ||||
| interfaces: | ||||
|   GigabitEthernet10/0/0: | ||||
|     sflow: true | ||||
|   GigabitEthernet10/0/1: | ||||
|     sflow: true | ||||
|   GigabitEthernet10/0/2: | ||||
|     sflow: true | ||||
|   GigabitEthernet10/0/3: | ||||
|     sflow: true | ||||
|  | ||||
| sflow: | ||||
|   sampling-rate: 100 | ||||
|   polling-interval: 10 | ||||
|   header-bytes: 128 | ||||
| EOF | ||||
| pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp | ||||
| [INFO    ] root.main: Loading configfile vppcfg.yaml | ||||
| [INFO    ] vppcfg.config.valid_config: Configuration validated successfully | ||||
| [INFO    ] root.main: Configuration is valid | ||||
| [INFO    ] vppcfg.reconciler.write: Wrote 13 lines to /etc/vpp/config/vppcfg.vpp | ||||
| [INFO    ] root.main: Planning succeeded | ||||
| pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp | ||||
| ``` | ||||
|  | ||||
| The slick thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to | ||||
| 1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the `vppcfg plan` and `vppcfg | ||||
| apply` stages and the VPP dataplane will reflect the newly declared configuration. | ||||
|  | ||||
| ### hsflowd: Configuration | ||||
|  | ||||
| VPP will start to emit _Netlink_ messages, of type PSAMPLE with packet samples and of type USERSOCK | ||||
| with the custom messages about the interface names and counters. These latter custom messages have | ||||
| to be decoded, which is done by the _mod_vpp_ module, starting from release v2.11-5 | ||||
| [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)]. | ||||
|  | ||||
| Here's a minimalist configuration: | ||||
|  | ||||
| ``` | ||||
| pim@vpp0-0:~$ cat /etc/hsflowd.conf | ||||
| sflow { | ||||
|   collector { ip=127.0.0.1 udpport=16343 } | ||||
|   collector { ip=192.0.2.1 namespace=dataplane } | ||||
|   psample { group=1 } | ||||
|   vpp { osIndex=off } | ||||
| } | ||||
| ``` | ||||
|  | ||||
| {{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}} | ||||
|  | ||||
| There are two important details that can be confusing at first: \ | ||||
| **1.** kernel network namespaces \ | ||||
| **2.** interface index namespaces | ||||
|  | ||||
| #### hsflowd: Network namespace | ||||
|  | ||||
| When started by systemd, `hsflowd` and VPP will normally both run in the _default_ network | ||||
| namespace. Network namespaces virtualize Linux's network stack. On creation, a network namespace | ||||
| contains only a loopback interface, and subsequently interfaces can be moved between namespaces. | ||||
| Each network namespace will have its own set of IP addresses, its own routing table, socket listing, | ||||
| connection tracking table, firewall, and other network-related resources.  | ||||
|  | ||||
| Given this, I can conclude that when the sFlow plugin opens a Netlink channel, it will | ||||
| naturally do this in the network namespace that its VPP process is running in (the _default_ | ||||
| namespace by default). It is therefore important that the recipient of all the Netlink messages, | ||||
| notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them in a | ||||
| different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other. | ||||
|  | ||||
| It might pose a problem if the network connectivity lives in a different namespace than the default | ||||
| one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface | ||||
| pairs, _LIPs_, in a dataplane namespace. The main reason for doing this is to allow something like | ||||
| FRR or Bird to completely govern the routing table in the kernel and keep it in-sync with the FIB in | ||||
| VPP. In such a _dataplane_ network namespace, typically every interface is owned by VPP. | ||||
|  | ||||
| Luckily, `hsflowd` can attach to one (default) namespace to get the PSAMPLEs, but create a socket in | ||||
| a _different_ (dataplane) namespace to send packets to a collector. This explains the second | ||||
| _collector_ entry in the config-file above. Here, `hsflowd` will send UDP packets to 192.0.2.1:6343 | ||||
| from within the (VPP) dataplane namespace, and to 127.0.0.1:16343 in the default namespace. | ||||
|  | ||||
| #### hsflowd: osIndex | ||||
|  | ||||
| I hope the previous section made some sense, because this one will be a tad more esoteric. When | ||||
| creating a network namespace, each interface will get its own uint32 interface index that identifies | ||||
| it, and such an ID is typically called an `ifIndex`. It's important to note that the same number can | ||||
| (and will!) occur multiple times, once for each namespace. Let me give you an example: | ||||
|  | ||||
| ``` | ||||
| pim@summer:~$ ip link | ||||
| 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ... | ||||
|     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 | ||||
| 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ipng-sl state UP ... | ||||
|     link/ether 00:22:19:6a:46:2e brd ff:ff:ff:ff:ff:ff | ||||
|     altname enp1s0f0 | ||||
| 3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 900 qdisc mq master ipng-sl state DOWN ... | ||||
|     link/ether 00:22:19:6a:46:30 brd ff:ff:ff:ff:ff:ff | ||||
|     altname enp1s0f1 | ||||
|  | ||||
| pim@summer:~$ ip netns exec dataplane ip link | ||||
| 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ... | ||||
|     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 | ||||
| 2: loop0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ... | ||||
|     link/ether de:ad:00:00:00:00 brd ff:ff:ff:ff:ff:ff | ||||
| 3: xe1-0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ... | ||||
|     link/ether 00:1b:21:bd:c7:18 brd ff:ff:ff:ff:ff:ff | ||||
| ``` | ||||
|  | ||||
| I want to draw your attention to the number at the beginning of the line. In the _default_ | ||||
| namespace, `ifIndex=3` corresponds to `ifName=eno2` (which has no link, it's marked `DOWN`). But in | ||||
| the _dataplane_ namespace, that index corresponds to a completely different interface called | ||||
| `ifName=xe1-0` (which is link `UP`). | ||||
|  | ||||
| Now, let me show you the interfaces in VPP: | ||||
|  | ||||
| ``` | ||||
| pim@summer:~$ vppctl show int | grep Gigabit | egrep 'Name|loop0|tap0|Gigabit' | ||||
|               Name               Idx    State  MTU (L3/IP4/IP6/MPLS) | ||||
| GigabitEthernet4/0/0              1      up          9000/0/0/0 | ||||
| GigabitEthernet4/0/1              2     down         9000/0/0/0 | ||||
| GigabitEthernet4/0/2              3     down         9000/0/0/0 | ||||
| GigabitEthernet4/0/3              4     down         9000/0/0/0 | ||||
| TenGigabitEthernet5/0/0           5      up          9216/0/0/0 | ||||
| TenGigabitEthernet5/0/1           6      up          9216/0/0/0 | ||||
| loop0                             7      up          9216/0/0/0 | ||||
| tap0                              19     up          9216/0/0/0 | ||||
| ``` | ||||
|  | ||||
| Here, I want you to look at the second column `Idx`, which shows what VPP calls the _sw_if_index_ | ||||
| (the software interface index, as opposed to hardware index). Here, `ifIndex=3` corresponds to | ||||
| `ifName=GigabitEthernet4/0/2`, which is neither `eno2` nor `xe1-0`. Oh my, yet _another_ namespace! | ||||
|  | ||||
| So there are three (relevant) types of namespaces at play here: | ||||
| 1.  ***Linux network*** namespace; here using `dataplane` and `default` each with overlapping numbering | ||||
| 1.  ***VPP hardware*** interface namespace, also called PHYs (for physical interfaces). When VPP | ||||
|     first attaches to or creates fundamental network interfaces like from DPDK or RDMA, these will | ||||
|     create an _hw_if_index_ in a list. | ||||
| 1.  ***VPP software*** interface namespace. Any interface (including hardware ones!) will also | ||||
|     receive a _sw_if_index_ in VPP. A good example is sub-interfaces: if I create a sub-int on | ||||
|     GigabitEthernet4/0/2, it will NOT get a hardware index, but it _will_ get the next available | ||||
|     software index (in this example, `sw_if_index=7`). | ||||
|  | ||||
| In Linux CP, I can map one into the other, look at this: | ||||
|  | ||||
| ``` | ||||
| pim@summer:~$ vppctl show lcp | ||||
| lcp default netns dataplane | ||||
| lcp lcp-auto-subint off | ||||
| lcp lcp-sync on | ||||
| lcp lcp-sync-unnumbered on | ||||
| itf-pair: [0] loop0 tap0 loop0 2 type tap netns dataplane | ||||
| itf-pair: [1] TenGigabitEthernet5/0/0 tap1 xe1-0 3 type tap netns dataplane | ||||
| itf-pair: [2] TenGigabitEthernet5/0/1 tap2 xe1-1 4 type tap netns dataplane | ||||
| itf-pair: [3] TenGigabitEthernet5/0/0.20 tap1.20 xe1-0.20 5 type tap netns dataplane | ||||
| ``` | ||||
|  | ||||
| Those `itf-pair` describe our _LIPs_, and they have the coordinates to three things. 1) The VPP | ||||
| software interface (VPP `ifName=loop0` with `sw_if_index=7`), which 2) Linux CP will mirror into the | ||||
| Linux kernel using a TAP device (VPP `ifName=tap0` with `sw_if_index=19`). That TAP has one leg in | ||||
| VPP (`tap0`), and another in 3) Linux (with `ifName=loop` and `ifIndex=2` in namespace `dataplane`). | ||||
|  | ||||
| > So the tuple that fully describes a _LIP_ is `{7, 19,'dataplane', 2}` | ||||
|  | ||||
| Climbing back out of that rabbit hole, I am now finally ready to explain the feature. When sFlow in | ||||
| VPP takes its sample, it will be doing this on a PHY, that is a given interface with a specific | ||||
| _hw_if_index_. When it polls the counters, it'll do it for that specific _hw_if_index_. It now has a | ||||
| choice: should it share with the world the representation of *its* namespace, or should it try to be | ||||
| smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the | ||||
| plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, look up a _LIP_ | ||||
| with it. If it finds one, it'll know both the namespace in which it lives as well as the osIndex in | ||||
| that namespace. If it doesn't, it will at least have the _sw_if_index_ at hand, so it'll annotate the | ||||
| USERSOCK with this information. | ||||
|  | ||||
| Now, `hsflowd` has a choice to make: does it share the Linux representation and hide VPP as an | ||||
| implementation detail? Or does it share the VPP dataplane _sw_if_index_? There are use cases | ||||
| relevant to both, so the decision was to let the operator decide, by setting `osIndex` either `on` | ||||
| (use Linux ifIndex) or `off` (use VPP sw_if_index). | ||||
|  | ||||
| ### hsflowd: Host Counters | ||||
|  | ||||
| Now that I understand the configuration parts of VPP and `hsflowd`, I decide to configure everything | ||||
| but I don't turn on any interfaces yet in VPP. Once I start the daemon, I can see that it | ||||
| sends an UDP packet every 30 seconds to the configured _collector_: | ||||
|  | ||||
| ``` | ||||
| pim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n | ||||
| tcpdump: verbose output suppressed, use -v[v]... for full protocol decode | ||||
| listening on lo, link-type EN10MB (Ethernet), snapshot length 9000 bytes | ||||
| 15:34:19.695042 IP 127.0.0.1.48753 > 127.0.0.1.6343: sFlowv5, | ||||
|    IPv4 agent 198.19.5.16, agent-id 100000, length 716 | ||||
| ``` | ||||
|  | ||||
| The `tcpdump` I have on my Debian bookworm machines doesn't know how to decode the contents of these | ||||
| sFlow packets. Actually, neither does Wireshark. I've attached a file of these mysterious packets | ||||
| [[sflow-host.pcap](/assets/sflow/sflow-host.pcap)] in case you want to take a look. | ||||
| Neil however gives me a tip. A full message decoder and otherwise handy Swiss army knife lives in | ||||
| [[sflowtool](https://github.com/sflow/sflowtool)]. | ||||
|  | ||||
| I can offer this pcap file to `sflowtool`, or let it just listen on the UDP port directly, and | ||||
| it'll tell me what it finds: | ||||
|  | ||||
| ``` | ||||
| pim@vpp0-0:~$ sflowtool -p 6343       | ||||
| startDatagram ================================= | ||||
| datagramSourceIP 127.0.0.1 | ||||
| datagramSize 716        | ||||
| unixSecondsUTC 1739112018 | ||||
| localtime 2025-02-09T15:40:18+0100 | ||||
| datagramVersion 5             | ||||
| agentSubId 100000    | ||||
| agent 198.19.5.16         | ||||
| packetSequenceNo 57  | ||||
| sysUpTime 987398    | ||||
| samplesInPacket 1             | ||||
| startSample ---------------------- | ||||
| sampleType_tag 0:4                                                          | ||||
| sampleType COUNTERSSAMPLE | ||||
| sampleSequenceNo 33 | ||||
| sourceId 2:1                 | ||||
| counterBlock_tag 0:2001            | ||||
| adaptor_0_ifIndex 2                                                         | ||||
| adaptor_0_MACs 1    | ||||
| adaptor_0_MAC_0 525400f00100 | ||||
| counterBlock_tag 0:2010 | ||||
| udpInDatagrams 123904 | ||||
| udpNoPorts 23132459 | ||||
| udpInErrors 0 | ||||
| udpOutDatagrams 46480629 | ||||
| udpRcvbufErrors 0 | ||||
| udpSndbufErrors 0 | ||||
| udpInCsumErrors 0 | ||||
| counterBlock_tag 0:2009 | ||||
| tcpRtoAlgorithm 1 | ||||
| tcpRtoMin 200 | ||||
| tcpRtoMax 120000 | ||||
| tcpMaxConn 4294967295 | ||||
| tcpActiveOpens 0 | ||||
| tcpPassiveOpens 30 | ||||
| tcpAttemptFails 0 | ||||
| tcpEstabResets 0 | ||||
| tcpCurrEstab 1 | ||||
| tcpInSegs 89120 | ||||
| tcpOutSegs 86961 | ||||
| tcpRetransSegs 59 | ||||
| tcpInErrs 0 | ||||
| tcpOutRsts 4 | ||||
| tcpInCsumErrors 0 | ||||
| counterBlock_tag 0:2008 | ||||
| icmpInMsgs 23129314 | ||||
| icmpInErrors 32 | ||||
| icmpInDestUnreachs 0 | ||||
| icmpInTimeExcds 23129282 | ||||
| icmpInParamProbs 0 | ||||
| icmpInSrcQuenchs 0 | ||||
| icmpInRedirects 0 | ||||
| icmpInEchos 0 | ||||
| icmpInEchoReps 32 | ||||
| icmpInTimestamps 0 | ||||
| icmpInAddrMasks 0 | ||||
| icmpInAddrMaskReps 0 | ||||
| icmpOutMsgs 0 | ||||
| icmpOutErrors 0 | ||||
| icmpOutDestUnreachs 23132467 | ||||
| icmpOutTimeExcds 0 | ||||
| icmpOutParamProbs 23132467 | ||||
| icmpOutSrcQuenchs 0 | ||||
| icmpOutRedirects 0 | ||||
| icmpOutEchos 0 | ||||
| icmpOutEchoReps 0 | ||||
| icmpOutTimestamps 0 | ||||
| icmpOutTimestampReps 0 | ||||
| icmpOutAddrMasks 0 | ||||
| icmpOutAddrMaskReps 0 | ||||
| counterBlock_tag 0:2007 | ||||
| ipForwarding 2 | ||||
| ipDefaultTTL 64 | ||||
| ipInReceives 46590552 | ||||
| ipInHdrErrors 0 | ||||
| ipInAddrErrors 0 | ||||
| ipForwDatagrams 0 | ||||
| ipInUnknownProtos 0 | ||||
| ipInDiscards 0 | ||||
| ipInDelivers 46402357 | ||||
| ipOutRequests 69613096 | ||||
| ipOutDiscards 0 | ||||
| ipOutNoRoutes 80 | ||||
| ipReasmTimeout 0 | ||||
| ipReasmReqds 0 | ||||
| ipReasmOKs 0 | ||||
| ipReasmFails 0 | ||||
| ipFragOKs 0 | ||||
| ipFragFails 0 | ||||
| ipFragCreates 0 | ||||
| counterBlock_tag 0:2005 | ||||
| disk_total 6253608960 | ||||
| disk_free 2719039488 | ||||
| disk_partition_max_used 56.52 | ||||
| disk_reads 11512 | ||||
| disk_bytes_read 626214912 | ||||
| disk_read_time 48469 | ||||
| disk_writes 1058955 | ||||
| disk_bytes_written 8924332032 | ||||
| disk_write_time 7954804 | ||||
| counterBlock_tag 0:2004 | ||||
| mem_total 8326963200 | ||||
| mem_free 5063872512 | ||||
| mem_shared 0 | ||||
| mem_buffers 86425600 | ||||
| mem_cached 827752448 | ||||
| swap_total 0 | ||||
| swap_free 0 | ||||
| page_in 306365 | ||||
| page_out 4357584 | ||||
| swap_in 0 | ||||
| swap_out 0 | ||||
| counterBlock_tag 0:2003 | ||||
| cpu_load_one 0.030 | ||||
| cpu_load_five 0.050 | ||||
| cpu_load_fifteen 0.040 | ||||
| cpu_proc_run 1 | ||||
| cpu_proc_total 138 | ||||
| cpu_num 2 | ||||
| cpu_speed 1699 | ||||
| cpu_uptime 1699306 | ||||
| cpu_user 64269210 | ||||
| cpu_nice 1810 | ||||
| cpu_system 34690140 | ||||
| cpu_idle 3234293560 | ||||
| cpu_wio 3568580 | ||||
| cpuintr 0 | ||||
| cpu_sintr 5687680 | ||||
| cpuinterrupts 1596621688 | ||||
| cpu_contexts 3246142972 | ||||
| cpu_steal 329520 | ||||
| cpu_guest 0 | ||||
| cpu_guest_nice 0 | ||||
| counterBlock_tag 0:2006 | ||||
| nio_bytes_in 250283 | ||||
| nio_pkts_in 2931 | ||||
| nio_errs_in 0 | ||||
| nio_drops_in 0 | ||||
| nio_bytes_out 370244 | ||||
| nio_pkts_out 1640 | ||||
| nio_errs_out 0 | ||||
| nio_drops_out 0 | ||||
| counterBlock_tag 0:2000 | ||||
| hostname vpp0-0 | ||||
| UUID ec933791-d6af-7a93-3b8d-aab1a46d6faa | ||||
| machine_type 3 | ||||
| os_name 2 | ||||
| os_release 6.1.0-26-amd64 | ||||
| endSample   ---------------------- | ||||
| endDatagram   ================================= | ||||
| ``` | ||||
|  | ||||
| If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might I | ||||
| agree with you. But it is really cool to see that every 30 seconds, the _collector_ will receive | ||||
| this form of heartbeat from the _agent_. There's a lot of vitalsigns in this packet, including some | ||||
| non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version | ||||
| information. It's super dope! | ||||
|  | ||||
| ### hsflowd: Interface Counters | ||||
|  | ||||
| Next, I'll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to | ||||
| something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed, | ||||
| every ten seconds or so I get a few packets, which I captured them in | ||||
| [[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Some of the packets contain only one | ||||
| counter record, while others contain more than one (in the PCAP, packet #9 has two). If I update the | ||||
| polling-interval to every second, I can see that most of the packets have four counters. | ||||
|  | ||||
| Those interface counters, as decoded by `sflowtool`, look like this: | ||||
|  | ||||
| ``` | ||||
| pim@vpp0-0:~$ sflowtool -r sflow-interface.pcap | \ | ||||
|               awk '/startSample/ { on=1 } { if (on) { print $0 } } /endSample/ { on=0 }' | ||||
| startSample   ---------------------- | ||||
| sampleType_tag 0:4 | ||||
| sampleType COUNTERSSAMPLE | ||||
| sampleSequenceNo 745 | ||||
| sourceId 0:3 | ||||
| counterBlock_tag 0:1005 | ||||
| ifName GigabitEthernet10/0/2 | ||||
| counterBlock_tag 0:1 | ||||
| ifIndex 3 | ||||
| networkType 6 | ||||
| ifSpeed 0 | ||||
| ifDirection 1 | ||||
| ifStatus 3 | ||||
| ifInOctets 858282015 | ||||
| ifInUcastPkts 780540 | ||||
| ifInMulticastPkts 0 | ||||
| ifInBroadcastPkts 0 | ||||
| ifInDiscards 0 | ||||
| ifInErrors 0 | ||||
| ifInUnknownProtos 0 | ||||
| ifOutOctets 1246716016 | ||||
| ifOutUcastPkts 975772 | ||||
| ifOutMulticastPkts 0 | ||||
| ifOutBroadcastPkts 0 | ||||
| ifOutDiscards 127 | ||||
| ifOutErrors 28 | ||||
| ifPromiscuousMode 0 | ||||
| endSample   ---------------------- | ||||
| ``` | ||||
|  | ||||
| What I find particularly cool about it, is that sFlow provides an automatic mapping between the | ||||
| `ifName=GigabitEthernet10/0/2` (tag 0:1005), together and an object (tag 0:1), which contains the | ||||
| `ifIndex=3`, and lots of packet and octet counters both in the ingress and egress direction. This is | ||||
| super useful for upstream _collectors_, as they can now find the hostname, agent name and address, | ||||
| and the correlation between interface names and their indexes. Noice! | ||||
|  | ||||
| #### hsflowd: Packet Samples | ||||
|  | ||||
| Now it's time to ratchet up the packet sampling, so I move it from 1:100M to 1:1000, while keeping | ||||
| the interface polling-interval at 10 seconds and I ask VPP to sample 64 bytes of each packet that it | ||||
| inspects. On either side of my pet VPP instance, I start an `iperf3` run to generate some traffic. I | ||||
| now see a healthy stream of sflow packets coming in on port 6343. They still contain every 30 | ||||
| seconds or so a host counter, and every 10 seconds a set of interface counters come by, but mostly | ||||
| these UDP packets are showing me samples. I've captured a few minutes of these in | ||||
| [[sflow-all.pcap](/assets/sflow/sflow-all.pcap)]. | ||||
| Although Wireshark doesn't know how to interpret the sFlow counter messages, it _does_ know how to | ||||
| interpret the sFlow sample messagess, and it reveals one of them like this: | ||||
|  | ||||
| {{< image width="100%" src="/assets/sflow/sflow-wireshark.png" alt="sFlow Wireshark" >}} | ||||
|  | ||||
| Let me take a look at the picture from top to bottom. First, the outer header (from 127.0.0.1:48753 | ||||
| to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as | ||||
| having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it's going to | ||||
| send 9 samples, the first of which says it's from ifIndex=2 and at a sampling rate of 1:1000. It | ||||
| then shows the sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those | ||||
| are sampled.  Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and | ||||
| DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my `iperf3`, | ||||
| booyah! | ||||
|  | ||||
| ### VPP: sFlow Performance | ||||
|  | ||||
| {{< image float="right" src="/assets/sflow/sflow-lab.png" alt="sFlow Lab" width="20em" >}} | ||||
|  | ||||
| One question I get a lot about this plugin is: what is the performance impact when using | ||||
| sFlow? I spent a considerable amount of time tinkering with this, and together with Neil bringing | ||||
| the plugin to what we both agree is the most efficient use of CPU. We could go a bit further, but | ||||
| that would require somewhat intrusive changes to VPP's internals and as _North of the Border_ would | ||||
| say: what we have isn't just good, it's good enough! | ||||
|  | ||||
| I've built a small testbed based on two Dell R730 machines. On the left, I have a Debian machine | ||||
| running Cisco T-Rex using four quad-tengig network cards, the classic Intel i710-DA4. On the right, | ||||
| I have my VPP machine called _Hippo_ (because it's always hungry for packets), with the same | ||||
| hardware.  I'll build two halves. On the top NIC (Te3/0/0-3 in VPP), I will install IPv4 and MPLS | ||||
| forwarding on the purple circuit, and a simple Layer2 cross connect on the cyan circuit. On all four | ||||
| interfaces, I will enable sFlow. Then, I will mirror this configuration on the bottom NIC | ||||
| (Te130/0/0-3) in the red and green circuits, for which I will leave sFlow turned off. | ||||
|  | ||||
| To help you reproduce my results, and under the assumption that this is your jam, here's the | ||||
| configuration for all of the kit: | ||||
|  | ||||
| ***0. Cisco T-Rex*** | ||||
| ``` | ||||
| pim@trex:~ $ cat /srv/trex/8x10.yaml | ||||
| - version: 2 | ||||
|   interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ] | ||||
|   port_info: | ||||
|     - src_mac:  00:1b:21:06:00:00 | ||||
|       dest_mac: 9c:69:b4:61:a1:dc    # Connected to Hippo Te3/0/0, purple | ||||
|     - src_mac:  00:1b:21:06:00:01 | ||||
|       dest_mac: 9c:69:b4:61:a1:dd    # Connected to Hippo Te3/0/1, purple | ||||
|     - src_mac:  00:1b:21:83:00:00 | ||||
|       dest_mac: 00:1b:21:83:00:01    # L2XC via Hippo Te3/0/2, cyan | ||||
|     - src_mac:  00:1b:21:83:00:01 | ||||
|       dest_mac: 00:1b:21:83:00:00    # L2XC via Hippo Te3/0/3, cyan | ||||
|  | ||||
|     - src_mac:  00:1b:21:87:00:00 | ||||
|       dest_mac: 9c:69:b4:61:75:d0    # Connected to Hippo Te130/0/0, red | ||||
|     - src_mac:  00:1b:21:87:00:01 | ||||
|       dest_mac: 9c:69:b4:61:75:d1    # Connected to Hippo Te130/0/1, red | ||||
|     - src_mac:  9c:69:b4:85:00:00 | ||||
|       dest_mac: 9c:69:b4:85:00:01    # L2XC via Hippo Te130/0/2, green | ||||
|     - src_mac:  9c:69:b4:85:00:01 | ||||
|       dest_mac: 9c:69:b4:85:00:00    # L2XC via Hippo Te130/0/3, green | ||||
| pim@trex:~ $ sudo t-rex-64 -i -c 4 --cfg /srv/trex/8x10.yaml | ||||
| ``` | ||||
|  | ||||
| When constructing the T-Rex configuration, I specifically set the destination MAC address for L3 | ||||
| circuits (the purple and red ones) using Hippo's interface MAC address, which I can find with | ||||
| `vppctl show hardware-interfaces`. This way, T-Rex does not have to ARP for the VPP endpoint. On | ||||
| L2XC circuits (the cyan and green ones), VPP does not concern itself with the MAC addressing at | ||||
| all. It puts its interface in _promiscuous_ mode, and simply writes out any ethernet frame received, | ||||
| directly to the egress interface. | ||||
|  | ||||
| ***1. IPv4*** | ||||
| ``` | ||||
| hippo# set int state TenGigabitEthernet3/0/0 up | ||||
| hippo# set int state TenGigabitEthernet3/0/1 up | ||||
| hippo# set int state TenGigabitEthernet130/0/0 up | ||||
| hippo# set int state TenGigabitEthernet130/0/1 up | ||||
| hippo# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31 | ||||
| hippo# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31 | ||||
| hippo# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31 | ||||
| hippo# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31 | ||||
| hippo# ip route add 16.0.0.0/24 via 100.64.0.0 | ||||
| hippo# ip route add 48.0.0.0/24 via 100.64.1.0 | ||||
| hippo# ip route add 16.0.2.0/24 via 100.64.4.0 | ||||
| hippo# ip route add 48.0.2.0/24 via 100.64.5.0 | ||||
| hippo# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static | ||||
| hippo# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static | ||||
| hippo# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static | ||||
| hippo# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static | ||||
| ``` | ||||
|  | ||||
| By the way, one note to this last piece, I'm setting static IPv4 neighbors so that Cisco T-Rex  | ||||
| as well as VPP do not have to use ARP to resolve each other. You'll see above that the T-Rex | ||||
| configuration also uses MAC addresses exclusively. Setting the `ip neighbor` like this allows VPP | ||||
| to know where to send return traffic. | ||||
|  | ||||
| ***2. MPLS*** | ||||
| ``` | ||||
| hippo# mpls table add 0 | ||||
| hippo# set interface mpls TenGigabitEthernet3/0/0 enable | ||||
| hippo# set interface mpls TenGigabitEthernet3/0/1 enable | ||||
| hippo# set interface mpls TenGigabitEthernet130/0/0 enable | ||||
| hippo# set interface mpls TenGigabitEthernet130/0/1 enable | ||||
| hippo# mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17 | ||||
| hippo# mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16 | ||||
| hippo# mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21 | ||||
| hippo# mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20 | ||||
| ``` | ||||
|  | ||||
| Here, the MPLS configuration implements a simple P-router, where incoming MPLS packets with label 16 | ||||
| will be sent back to T-Rex on Te3/0/1 to the specified IPv4 nexthop (for which we already know the | ||||
| MAC address), and with label 16 removed and new label 17 imposed, in other words a SWAP operation. | ||||
|  | ||||
| ***3. L2XC*** | ||||
| ``` | ||||
| hippo# set int state TenGigabitEthernet3/0/2 up | ||||
| hippo# set int state TenGigabitEthernet3/0/3 up | ||||
| hippo# set int state TenGigabitEthernet130/0/2 up | ||||
| hippo# set int state TenGigabitEthernet130/0/3 up | ||||
| hippo# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3 | ||||
| hippo# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2 | ||||
| hippo# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3 | ||||
| hippo# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2 | ||||
| ``` | ||||
|  | ||||
| I've added a layer2 cross connect as well because it's computationally very cheap for VPP to receive | ||||
| an L2 (ethernet) datagram, and immediately transmit it on another interface. There's no FIB lookup | ||||
| and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as | ||||
| fast as it can! | ||||
|  | ||||
| Here's how a loadtest looks like when sending 80Gbps at 256b packets on all eight interfaces: | ||||
|  | ||||
| {{< image src="/assets/sflow/sflow-lab-trex.png" alt="sFlow T-Rex" width="100%" >}} | ||||
|  | ||||
| The leftmost ports p0 <-> p1 are sending IPv4+MPLS, while ports p0 <-> p2 are sending ethernet back | ||||
| and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These | ||||
| four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 <-> p4 and p5 <-> p6 | ||||
| respectively have sFlow turned off but with the same configuration.  They are my control, showing | ||||
| the CPU use without sFLow. | ||||
|  | ||||
| **First conclusion**: This stuff works a treat. There is absolutely no impact of throughput at | ||||
| 80Gbps with 47.6Mpps either _with_, or _without_ sFlow turned on. That's wonderful news, as it shows | ||||
| that the dataplane has more CPU available than is needed for any combination of functionality. | ||||
|  | ||||
| But what _is_ the limit? For this, I'll take a deeper look at the runtime statistics by varying the | ||||
| CPU time spent and maximum throughput achievable on a single VPP worker, thus using a single CPU | ||||
| thread on this Hippo machine that has 44 cores and 44 hyperthreads. I switch the loadtester to emit | ||||
| 64 byte ethernet packets, the smallest I'm allowed to send. | ||||
|  | ||||
| | Loadtest | no sFlow | 1:1'000'000 | 1:10'000 | 1:1'000 | 1:100 | | ||||
| |-------------|-----------|-----------|-----------|-----------|-----------| | ||||
| | L2XC        | 14.88Mpps | 14.32Mpps | 14.31Mpps | 14.27Mpps | 14.15Mpps | | ||||
| | IPv4        | 10.89Mpps | 9.88Mpps  | 9.88Mpps  | 9.84Mpps  | 9.73Mpps  | | ||||
| | MPLS        | 10.11Mpps | 9.52Mpps  | 9.52Mpps  | 9.51Mpps  | 9.45Mpps  | | ||||
| | ***sFlow Packets*** / 10sec | N/A       | 337.42M total | 337.39M total | 336.48M total | 333.64M total | | ||||
| | .. Sampled    |      | 328 | 33.8k | 336k | 3.34M |  | ||||
| | .. Sent       |      | 328 | 33.8k | 336k | 1.53M | | ||||
| | .. Dropped    |      | 0   | 0     | 0    | 1.81M | | ||||
|  | ||||
| Here I can make a few important observations. | ||||
|  | ||||
| **Baseline**: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which | ||||
| implies that it has a little bit of CPU left over to do other work, if needed.  With IPv4, I can see | ||||
| that the throughput is CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I | ||||
| know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The | ||||
| total capacity is 10.11Mpps when sFlow is turned off. | ||||
|  | ||||
| **Overhead**: If I turn on sFLow on the interface, VPP will insert the _sflow-node_ into the | ||||
| dataplane graph between `device-input` and `ethernet-input`. It means that the sFlow node will see | ||||
| _every single_ packet, and it will have to move all of these into the next node, which costs about | ||||
| 9.5 CPU cycles per packet.  The regression on L2XC is 3.8% but I have to note that VPP was not CPU | ||||
| bound on the L2XC so it used some CPU cycles which were still available, before regressing | ||||
| throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS. | ||||
|  | ||||
| **Sampling Cost**: When then doing higher rates of sampling, the further regression is not _that_ | ||||
| terrible.  Between 1:1'000'000 and 1:10'000, there's barely a noticeable difference. Even in the | ||||
| worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%.  The | ||||
| regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS). | ||||
| Of course, by using multiple hardware receive queues and multiple RX workers per interface, the cost | ||||
| can be kept well in hand. | ||||
|  | ||||
| **Overload Protection**: At 1:1'000 and an effective rate of 33.65Mpps across all ports, I correctly | ||||
| observe 336k samples taken, and sent to PSAMPLE.  At 1:100 however, there are 3.34M samples, but | ||||
| they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream | ||||
| `sflow-main` thread and `hsflowd`. I can see that here, 1.81M samples have been dropped, while 1.53M | ||||
| samples made it through.  By the way, this means VPP is happily sending 153K samples/sec to the | ||||
| collector! | ||||
|  | ||||
| ## What's Next | ||||
|  | ||||
| Now that I've seen the UDP packets from our agent to a collector on the wire, and also how | ||||
| incredibly efficient the sFlow sampling implementation turned out, I'm super motivated to | ||||
| continue the journey with higher level collector receivers like ntopng, sflow-rt or Akvorado. In an | ||||
| upcoming article, I'll describe how I rolled out Akvorado at IPng, and what types of changes would | ||||
| make the user experience even better (or simpler to understand, at least). | ||||
|  | ||||
| ### Acknowledgements | ||||
|  | ||||
| I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the | ||||
| finer details such as logging, error handling, API specifications, and documentation. He has been a | ||||
| true pleasure to work with and learn from. Also, thank you to the VPP developer community, notably | ||||
| Benoit, Florin, Damjan, Dave and Matt, for helping with the review and getting this thing merged in | ||||
| time for the 25.02 release. | ||||
							
								
								
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-all.pcap
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-all.pcap
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
							
								
								
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-host.pcap
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-host.pcap
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
							
								
								
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-interface.pcap
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-interface.pcap
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
							
								
								
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-lab-trex.png
									
									LFS
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-lab-trex.png
									
									LFS
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
							
								
								
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-lab.png
									
									LFS
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-lab.png
									
									LFS
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
							
								
								
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-overview.png
									
									LFS
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-overview.png
									
									LFS
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
							
								
								
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-vpp-overview.png
									
									LFS
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-vpp-overview.png
									
									LFS
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
							
								
								
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-wireshark.png
									
									LFS
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								static/assets/sflow/sflow-wireshark.png
									
									LFS
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
		Reference in New Issue
	
	Block a user