Readability edits
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
Pim van Pelt
2025-02-09 22:21:18 +01:00
parent 4ac8c47127
commit 533cca0108

View File

@ -85,13 +85,13 @@ few weeks! Last weekend, I gave a lightning talk at
in Brussels, Belgium, and caught up with a lot of community members and network- and software
engineers. I had a great time.
In trying to keep the amount of code as small as possible, and therefor the probability of bugs that
In trying to keep the amount of code as small as possible, and therefore the probability of bugs that
might impact VPP's dataplane stability low, the architecture of the end to end solution consists of
three distinct parts, each with their own risk and performance profile:
{{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}}
**sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
**1. sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
packets from `device-input` to the `ethernet-input` nodes in its forwarding graph, the sFlow plugin
will inspect 1-in-N, taking a sample for further processing. Here, we don't try to be clever, simply
copy the `inIfIndex` and the first bytes of the ethernet frame, and append them to a
@ -100,17 +100,17 @@ arrive, samples are dropped at the tail, and a counter incremented. This way, I
dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to
get their fair share of samples into the Agent's hands.
**sFlow main process**: There's a function running on the _main thread_, which shifts further
**2. sFlow main process**: There's a function running on the _main thread_, which shifts further
processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it
consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones
in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is
configurable), it'll grab all interface counters from those interfaces for which I have sFlow
turned on. VPP produces _Netlink_ messages and sends them to the kernel.
**host-sflow**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
**3. Host sFlow daemon**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
messages. It goes without saying that `hsflowd` is a battle-hardened implementation running on
hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy,
this module already exists. But Neil implements a _mod_vpp_ which can grab interface names and their
this module already exists. But Neil implemented a _mod_vpp_ which can grab interface names and their
`ifIndex`, and counter statistics. VPP emits this data as _Netlink_ `USERSOCK` messages alongside
the PSAMPLEs.
@ -119,8 +119,8 @@ By the way, I've written about _Netlink_ before when discussing the [[Linux Cont
2021-08-25-vpp-4 >}})] plugin. It's a mechanism for programs running in userspace to share
information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from
kernel to userspace using a _PSAMPLE_ Netlink channel. However, the pattern is that of a message
producer/subscriber queue and nothing precludes one userspace process (VPP) to be the producer while
another (hsflowd) is the consumer!
producer/subscriber relationship and nothing precludes one userspace process (`vpp`) to be the
producer while another userspace process (`hsflowd`) acts as the consumer!
Assuming the sFlow plugin in VPP produces samples and counters properly, `hsflowd` will do the rest,
giving correctness and upstream interoperability pretty much for free. That's slick!
@ -148,14 +148,15 @@ vpp0-0# sflow enable GigabitEthernet10/0/3
```
The first three commands set the global defaults - in my case I'm going to be sampling at 1:100
which is unusually high frequency. A production setup may take 1-in-_linkspeed-in-megabits_ so for a
which is an unusually high rate. A production setup may take 1-in-_linkspeed-in-megabits_ so for a
1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more
appropriate, depending on link load. The second command sets the interface stats polling interval.
The default is to gather these statistics every 20 seconds, but I set it to 10s here.
Next, I instruct the plugin how many bytes of the sampled ethernet frame should be taken. Common
values are 64 and 128. I want enough data to see the headers, like MPLS label(s), Dot1Q tag(s), IP
header and TCP/UDP/ICMP header, but the contents of the payload are rarely interesting for
Next, I tell the plugin how many bytes of the sampled ethernet frame should be taken. Common
values are 64 and 128 but it doesn't have to be a power of two. I want enough data to see the
headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but the contents of
the payload are rarely interesting for
statistics purposes.
Finally, I can turn on the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP,
@ -166,8 +167,8 @@ maybe to write `sflow enable $iface disable` but it makes more logical sends if
***2. VPP Configuration via API***
I implemented a few API calls for the most common operations. Here's a snippet that shows the same
calls as from the CLI above, but using these Python API calls:
I implemented a few API methods for the most common operations. Here's a snippet that obtains the
same config as what I typed on the CLI above, but using these Python API calls:
```python
from vpp_papi import VPPApiClient, VPPApiJSONFiles
@ -214,7 +215,7 @@ the current value. Then I set the polling interval to 10s and retrieve the curre
Finally, I set the header bytes to 128, and retrieve the value again.
Enabling and disabling sFlow on interfaces shows the idiom I mentioned before - the API being an
`enable_disable()` call of sorts, and typically taking a boolean argument if the operator wants to
`*_enable_disable()` call of sorts, and typically taking a boolean argument if the operator wants to
enable (the default), or disable sFlow on the interface. Getting the list of enabled interfaces can
be done with the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details`
messages.
@ -225,12 +226,12 @@ article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests
***3. VPPCfg YAML Configuration***
Writing on the CLI and calling the API is good and all, but many users of VPP have noticed that it
does not have any form of configuration persistence and that's on purpose. The project is a
does not have any form of configuration persistence and that's deliberate. VPP's goal is to be a
programmable dataplane, and explicitly has left the programming and configuration as an exercise for
integrators. I have written a small Python program that takes a YAML file as input and uses it to
configure (and reconfigure, on the fly) the dataplane automatically. It's called
[[VPPcfg](https://github.com/pimvanpelt/vppcfg.git)], and I wrote some implementation thoughts on
its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
integrators. I have written a Python project that takes a YAML file as input and uses it to
configure (and reconfigure, on the fly) the dataplane automatically, called
[[VPPcfg](https://github.com/pimvanpelt/vppcfg.git)]. Previously, I wrote some implementation thoughts
on its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
>}})] so I won't repeat that here. Instead, I will just show the configuration:
```
@ -259,16 +260,16 @@ pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp
pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp
```
The slick thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
The nifty thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the `vppcfg plan` and `vppcfg
apply` stages and the VPP dataplane will reflect the newly declared configuration.
apply` stages and the VPP dataplane will be reprogrammed to reflect the newly declared configuration.
### hsflowd: Configuration
VPP will start to emit _Netlink_ messages, of type PSAMPLE with packet samples and of type USERSOCK
with the custom messages about the interface names and counters. These latter custom messages have
to be decoded, which is done by the _mod_vpp_ module, starting from release v2.11-5
[[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
When sFlow is enabled, VPP will start to emit _Netlink_ messages of type PSAMPLE with packet samples
and of type USERSOCK with the custom messages containing interface names and counters. These latter
custom messages have to be decoded, which is done by the _mod_vpp_ module in `hsflowd`, starting
from release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
Here's a minimalist configuration:
@ -290,17 +291,17 @@ There are two important details that can be confusing at first: \
#### hsflowd: Network namespace
When started by systemd, `hsflowd` and VPP will normally both run in the _default_ network
namespace. Network namespaces virtualize Linux's network stack. On creation, a network namespace
contains only a loopback interface, and subsequently interfaces can be moved between namespaces.
Each network namespace will have its own set of IP addresses, its own routing table, socket listing,
connection tracking table, firewall, and other network-related resources.
Network namespaces virtualize Linux's network stack. Upon creation, a network namespace contains only
a loopback interface, and subsequently interfaces can be moved between namespaces. Each network
namespace will have its own set of IP addresses, its own routing table, socket listing, connection
tracking table, firewall, and other network-related resources. When started by systemd, `hsflowd`
and VPP will normally both run in the _default_ network namespace.
Given this, I can conclude that when the sFlow plugin opens a Netlink channel, it will
naturally do this in the network namespace that its VPP process is running in (the _default_
namespace by default). It is therefore important that the recipient of all the Netlink messages,
notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them in a
different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.
namespace, normally). It is therefore important that the recipient of these Netlink messages,
notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them together in
a different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.
It might pose a problem if the network connectivity lives in a different namespace than the default
one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface
@ -364,17 +365,18 @@ Here, I want you to look at the second column `Idx`, which shows what VPP calls
(the software interface index, as opposed to hardware index). Here, `ifIndex=3` corresponds to
`ifName=GigabitEthernet4/0/2`, which is neither `eno2` nor `xe1-0`. Oh my, yet _another_ namespace!
So there are three (relevant) types of namespaces at play here:
1. ***Linux network*** namespace; here using `dataplane` and `default` each with overlapping numbering
It turns out that there are three (relevant) types of namespaces at play here:
1. ***Linux network*** namespace; here using `dataplane` and `default` each with their own unique
(and overlapping) numbering.
1. ***VPP hardware*** interface namespace, also called PHYs (for physical interfaces). When VPP
first attaches to or creates fundamental network interfaces like from DPDK or RDMA, these will
first attaches to or creates network interfaces like the ones from DPDK or RDMA, these will
create an _hw_if_index_ in a list.
1. ***VPP software*** interface namespace. Any interface (including hardware ones!) will also
1. ***VPP software*** interface namespace. All interfaces (including hardware ones!) will
receive a _sw_if_index_ in VPP. A good example is sub-interfaces: if I create a sub-int on
GigabitEthernet4/0/2, it will NOT get a hardware index, but it _will_ get the next available
software index (in this example, `sw_if_index=7`).
In Linux CP, I can map one into the other, look at this:
In Linux CP, I can see a mapping from one to the other, just look at this:
```
pim@summer:~$ vppctl show lcp
@ -400,21 +402,21 @@ VPP takes its sample, it will be doing this on a PHY, that is a given interface
_hw_if_index_. When it polls the counters, it'll do it for that specific _hw_if_index_. It now has a
choice: should it share with the world the representation of *its* namespace, or should it try to be
smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the
plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, look up a _LIP_
with it. If it finds one, it'll know both the namespace in which it lives as well as the osIndex in
that namespace. If it doesn't, it will at least have the _sw_if_index_ at hand, so it'll annotate the
USERSOCK with this information.
plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, try to look up a
_LIP_ with it. If it finds one, it'll know both the namespace in which it lives as well as the
osIndex in that namespace. If it doesn't find a _LIP_, it will at least have the _sw_if_index_ at
hand, so it'll annotate the USERSOCK counter messages with this information instead.
Now, `hsflowd` has a choice to make: does it share the Linux representation and hide VPP as an
implementation detail? Or does it share the VPP dataplane _sw_if_index_? There are use cases
relevant to both, so the decision was to let the operator decide, by setting `osIndex` either `on`
(use Linux ifIndex) or `off` (use VPP sw_if_index).
(use Linux ifIndex) or `off` (use VPP _sw_if_index_).
### hsflowd: Host Counters
Now that I understand the configuration parts of VPP and `hsflowd`, I decide to configure everything
but I don't turn on any interfaces yet in VPP. Once I start the daemon, I can see that it
sends an UDP packet every 30 seconds to the configured _collector_:
but without enabling sFlow on on any interfaces yet in VPP. Once I start the daemon, I can see that
it sends an UDP packet every 30 seconds to the configured _collector_:
```
pim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n
@ -587,7 +589,7 @@ endSample ----------------------
endDatagram =================================
```
If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might I
If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might
agree with you. But it is really cool to see that every 30 seconds, the _collector_ will receive
this form of heartbeat from the _agent_. There's a lot of vitalsigns in this packet, including some
non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version
@ -597,10 +599,10 @@ information. It's super dope!
Next, I'll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to
something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed,
every ten seconds or so I get a few packets, which I captured them in
[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Some of the packets contain only one
counter record, while others contain more than one (in the PCAP, packet #9 has two). If I update the
polling-interval to every second, I can see that most of the packets have four counters.
every ten seconds or so I get a few packets, which I captured in
[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Most of the packets contain only one
counter record, while some contain more than one (in the PCAP, packet #9 has two). If I update the
polling-interval to every second, I can see that most of the packets have all four counters.
Those interface counters, as decoded by `sflowtool`, look like this:
@ -638,7 +640,7 @@ endSample ----------------------
```
What I find particularly cool about it, is that sFlow provides an automatic mapping between the
`ifName=GigabitEthernet10/0/2` (tag 0:1005), together and an object (tag 0:1), which contains the
`ifName=GigabitEthernet10/0/2` (tag 0:1005), together with an object (tag 0:1), which contains the
`ifIndex=3`, and lots of packet and octet counters both in the ingress and egress direction. This is
super useful for upstream _collectors_, as they can now find the hostname, agent name and address,
and the correlation between interface names and their indexes. Noice!
@ -661,10 +663,10 @@ Let me take a look at the picture from top to bottom. First, the outer header (f
to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as
having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it's going to
send 9 samples, the first of which says it's from ifIndex=2 and at a sampling rate of 1:1000. It
then shows the sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
then shows that sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
are sampled. Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and
DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my `iperf3`,
booyah!
DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my running
`iperf3`, booyah!
### VPP: sFlow Performance
@ -779,7 +781,7 @@ an L2 (ethernet) datagram, and immediately transmit it on another interface. The
and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as
fast as it can!
Here's how a loadtest looks like when sending 80Gbps at 256b packets on all eight interfaces:
Here's how a loadtest looks like when sending 80Gbps at 192b packets on all eight interfaces:
{{< image src="/assets/sflow/sflow-lab-trex.png" alt="sFlow T-Rex" width="100%" >}}
@ -787,7 +789,7 @@ The leftmost ports p0 <-> p1 are sending IPv4+MPLS, while ports p0 <-> p2 are se
and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These
four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 <-> p4 and p5 <-> p6
respectively have sFlow turned off but with the same configuration. They are my control, showing
the CPU use without sFLow.
the CPU use without sFlow.
**First conclusion**: This stuff works a treat. There is absolutely no impact of throughput at
80Gbps with 47.6Mpps either _with_, or _without_ sFlow turned on. That's wonderful news, as it shows
@ -812,18 +814,19 @@ Here I can make a few important observations.
**Baseline**: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which
implies that it has a little bit of CPU left over to do other work, if needed. With IPv4, I can see
that the throughput is CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
that the throughput is actually CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The
total capacity is 10.11Mpps when sFlow is turned off.
total capacity is 10.11Mpps for one worker, when sFlow is turned off.
**Overhead**: If I turn on sFLow on the interface, VPP will insert the _sflow-node_ into the
**Overhead**: When I turn on sFlow on the interface, VPP will insert the _sflow-node_ into the
forwarding graph between `device-input` and `ethernet-input`. It means that the sFlow node will see
_every single_ packet, and it will have to move all of these into the next node, which costs about
9.5 CPU cycles per packet. The regression on L2XC is 3.8% but I have to note that VPP was not CPU
bound on the L2XC so it used some CPU cycles which were still available, before regressing
throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS.
throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS, only to shuffle the
packets through the graph.
**Sampling Cost**: When then doing higher rates of sampling, the further regression is not _that_
**Sampling Cost**: But when then doing higher rates of sampling, the further regression is not _that_
terrible. Between 1:1'000'000 and 1:10'000, there's barely a noticeable difference. Even in the
worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%. The
regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS).
@ -834,8 +837,8 @@ can be kept well in hand.
observe 336k samples taken, and sent to PSAMPLE. At 1:100 however, there are 3.34M samples, but
they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream
`sflow-main` thread and `hsflowd`. I can see that here, 1.81M samples have been dropped, while 1.53M
samples made it through. By the way, this means VPP is happily sending 153K samples/sec to the
collector!
samples made it through. By the way, this means VPP is happily sending a whopping 153K samples/sec
to the collector!
## What's Next