This commit is contained in:
@ -85,13 +85,13 @@ few weeks! Last weekend, I gave a lightning talk at
|
||||
in Brussels, Belgium, and caught up with a lot of community members and network- and software
|
||||
engineers. I had a great time.
|
||||
|
||||
In trying to keep the amount of code as small as possible, and therefor the probability of bugs that
|
||||
In trying to keep the amount of code as small as possible, and therefore the probability of bugs that
|
||||
might impact VPP's dataplane stability low, the architecture of the end to end solution consists of
|
||||
three distinct parts, each with their own risk and performance profile:
|
||||
|
||||
{{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}}
|
||||
|
||||
**sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
|
||||
**1. sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
|
||||
packets from `device-input` to the `ethernet-input` nodes in its forwarding graph, the sFlow plugin
|
||||
will inspect 1-in-N, taking a sample for further processing. Here, we don't try to be clever, simply
|
||||
copy the `inIfIndex` and the first bytes of the ethernet frame, and append them to a
|
||||
@ -100,17 +100,17 @@ arrive, samples are dropped at the tail, and a counter incremented. This way, I
|
||||
dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to
|
||||
get their fair share of samples into the Agent's hands.
|
||||
|
||||
**sFlow main process**: There's a function running on the _main thread_, which shifts further
|
||||
**2. sFlow main process**: There's a function running on the _main thread_, which shifts further
|
||||
processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it
|
||||
consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones
|
||||
in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is
|
||||
configurable), it'll grab all interface counters from those interfaces for which I have sFlow
|
||||
turned on. VPP produces _Netlink_ messages and sends them to the kernel.
|
||||
|
||||
**host-sflow**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
|
||||
**3. Host sFlow daemon**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
|
||||
messages. It goes without saying that `hsflowd` is a battle-hardened implementation running on
|
||||
hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy,
|
||||
this module already exists. But Neil implements a _mod_vpp_ which can grab interface names and their
|
||||
this module already exists. But Neil implemented a _mod_vpp_ which can grab interface names and their
|
||||
`ifIndex`, and counter statistics. VPP emits this data as _Netlink_ `USERSOCK` messages alongside
|
||||
the PSAMPLEs.
|
||||
|
||||
@ -119,8 +119,8 @@ By the way, I've written about _Netlink_ before when discussing the [[Linux Cont
|
||||
2021-08-25-vpp-4 >}})] plugin. It's a mechanism for programs running in userspace to share
|
||||
information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from
|
||||
kernel to userspace using a _PSAMPLE_ Netlink channel. However, the pattern is that of a message
|
||||
producer/subscriber queue and nothing precludes one userspace process (VPP) to be the producer while
|
||||
another (hsflowd) is the consumer!
|
||||
producer/subscriber relationship and nothing precludes one userspace process (`vpp`) to be the
|
||||
producer while another userspace process (`hsflowd`) acts as the consumer!
|
||||
|
||||
Assuming the sFlow plugin in VPP produces samples and counters properly, `hsflowd` will do the rest,
|
||||
giving correctness and upstream interoperability pretty much for free. That's slick!
|
||||
@ -148,14 +148,15 @@ vpp0-0# sflow enable GigabitEthernet10/0/3
|
||||
```
|
||||
|
||||
The first three commands set the global defaults - in my case I'm going to be sampling at 1:100
|
||||
which is unusually high frequency. A production setup may take 1-in-_linkspeed-in-megabits_ so for a
|
||||
which is an unusually high rate. A production setup may take 1-in-_linkspeed-in-megabits_ so for a
|
||||
1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more
|
||||
appropriate, depending on link load. The second command sets the interface stats polling interval.
|
||||
The default is to gather these statistics every 20 seconds, but I set it to 10s here.
|
||||
|
||||
Next, I instruct the plugin how many bytes of the sampled ethernet frame should be taken. Common
|
||||
values are 64 and 128. I want enough data to see the headers, like MPLS label(s), Dot1Q tag(s), IP
|
||||
header and TCP/UDP/ICMP header, but the contents of the payload are rarely interesting for
|
||||
Next, I tell the plugin how many bytes of the sampled ethernet frame should be taken. Common
|
||||
values are 64 and 128 but it doesn't have to be a power of two. I want enough data to see the
|
||||
headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but the contents of
|
||||
the payload are rarely interesting for
|
||||
statistics purposes.
|
||||
|
||||
Finally, I can turn on the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP,
|
||||
@ -166,8 +167,8 @@ maybe to write `sflow enable $iface disable` but it makes more logical sends if
|
||||
|
||||
***2. VPP Configuration via API***
|
||||
|
||||
I implemented a few API calls for the most common operations. Here's a snippet that shows the same
|
||||
calls as from the CLI above, but using these Python API calls:
|
||||
I implemented a few API methods for the most common operations. Here's a snippet that obtains the
|
||||
same config as what I typed on the CLI above, but using these Python API calls:
|
||||
|
||||
```python
|
||||
from vpp_papi import VPPApiClient, VPPApiJSONFiles
|
||||
@ -214,7 +215,7 @@ the current value. Then I set the polling interval to 10s and retrieve the curre
|
||||
Finally, I set the header bytes to 128, and retrieve the value again.
|
||||
|
||||
Enabling and disabling sFlow on interfaces shows the idiom I mentioned before - the API being an
|
||||
`enable_disable()` call of sorts, and typically taking a boolean argument if the operator wants to
|
||||
`*_enable_disable()` call of sorts, and typically taking a boolean argument if the operator wants to
|
||||
enable (the default), or disable sFlow on the interface. Getting the list of enabled interfaces can
|
||||
be done with the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details`
|
||||
messages.
|
||||
@ -225,12 +226,12 @@ article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests
|
||||
***3. VPPCfg YAML Configuration***
|
||||
|
||||
Writing on the CLI and calling the API is good and all, but many users of VPP have noticed that it
|
||||
does not have any form of configuration persistence and that's on purpose. The project is a
|
||||
does not have any form of configuration persistence and that's deliberate. VPP's goal is to be a
|
||||
programmable dataplane, and explicitly has left the programming and configuration as an exercise for
|
||||
integrators. I have written a small Python program that takes a YAML file as input and uses it to
|
||||
configure (and reconfigure, on the fly) the dataplane automatically. It's called
|
||||
[[VPPcfg](https://github.com/pimvanpelt/vppcfg.git)], and I wrote some implementation thoughts on
|
||||
its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
|
||||
integrators. I have written a Python project that takes a YAML file as input and uses it to
|
||||
configure (and reconfigure, on the fly) the dataplane automatically, called
|
||||
[[VPPcfg](https://github.com/pimvanpelt/vppcfg.git)]. Previously, I wrote some implementation thoughts
|
||||
on its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
|
||||
>}})] so I won't repeat that here. Instead, I will just show the configuration:
|
||||
|
||||
```
|
||||
@ -259,16 +260,16 @@ pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp
|
||||
pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp
|
||||
```
|
||||
|
||||
The slick thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
|
||||
The nifty thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
|
||||
1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the `vppcfg plan` and `vppcfg
|
||||
apply` stages and the VPP dataplane will reflect the newly declared configuration.
|
||||
apply` stages and the VPP dataplane will be reprogrammed to reflect the newly declared configuration.
|
||||
|
||||
### hsflowd: Configuration
|
||||
|
||||
VPP will start to emit _Netlink_ messages, of type PSAMPLE with packet samples and of type USERSOCK
|
||||
with the custom messages about the interface names and counters. These latter custom messages have
|
||||
to be decoded, which is done by the _mod_vpp_ module, starting from release v2.11-5
|
||||
[[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
|
||||
When sFlow is enabled, VPP will start to emit _Netlink_ messages of type PSAMPLE with packet samples
|
||||
and of type USERSOCK with the custom messages containing interface names and counters. These latter
|
||||
custom messages have to be decoded, which is done by the _mod_vpp_ module in `hsflowd`, starting
|
||||
from release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
|
||||
|
||||
Here's a minimalist configuration:
|
||||
|
||||
@ -290,17 +291,17 @@ There are two important details that can be confusing at first: \
|
||||
|
||||
#### hsflowd: Network namespace
|
||||
|
||||
When started by systemd, `hsflowd` and VPP will normally both run in the _default_ network
|
||||
namespace. Network namespaces virtualize Linux's network stack. On creation, a network namespace
|
||||
contains only a loopback interface, and subsequently interfaces can be moved between namespaces.
|
||||
Each network namespace will have its own set of IP addresses, its own routing table, socket listing,
|
||||
connection tracking table, firewall, and other network-related resources.
|
||||
Network namespaces virtualize Linux's network stack. Upon creation, a network namespace contains only
|
||||
a loopback interface, and subsequently interfaces can be moved between namespaces. Each network
|
||||
namespace will have its own set of IP addresses, its own routing table, socket listing, connection
|
||||
tracking table, firewall, and other network-related resources. When started by systemd, `hsflowd`
|
||||
and VPP will normally both run in the _default_ network namespace.
|
||||
|
||||
Given this, I can conclude that when the sFlow plugin opens a Netlink channel, it will
|
||||
naturally do this in the network namespace that its VPP process is running in (the _default_
|
||||
namespace by default). It is therefore important that the recipient of all the Netlink messages,
|
||||
notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them in a
|
||||
different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.
|
||||
namespace, normally). It is therefore important that the recipient of these Netlink messages,
|
||||
notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them together in
|
||||
a different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.
|
||||
|
||||
It might pose a problem if the network connectivity lives in a different namespace than the default
|
||||
one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface
|
||||
@ -364,17 +365,18 @@ Here, I want you to look at the second column `Idx`, which shows what VPP calls
|
||||
(the software interface index, as opposed to hardware index). Here, `ifIndex=3` corresponds to
|
||||
`ifName=GigabitEthernet4/0/2`, which is neither `eno2` nor `xe1-0`. Oh my, yet _another_ namespace!
|
||||
|
||||
So there are three (relevant) types of namespaces at play here:
|
||||
1. ***Linux network*** namespace; here using `dataplane` and `default` each with overlapping numbering
|
||||
It turns out that there are three (relevant) types of namespaces at play here:
|
||||
1. ***Linux network*** namespace; here using `dataplane` and `default` each with their own unique
|
||||
(and overlapping) numbering.
|
||||
1. ***VPP hardware*** interface namespace, also called PHYs (for physical interfaces). When VPP
|
||||
first attaches to or creates fundamental network interfaces like from DPDK or RDMA, these will
|
||||
first attaches to or creates network interfaces like the ones from DPDK or RDMA, these will
|
||||
create an _hw_if_index_ in a list.
|
||||
1. ***VPP software*** interface namespace. Any interface (including hardware ones!) will also
|
||||
1. ***VPP software*** interface namespace. All interfaces (including hardware ones!) will
|
||||
receive a _sw_if_index_ in VPP. A good example is sub-interfaces: if I create a sub-int on
|
||||
GigabitEthernet4/0/2, it will NOT get a hardware index, but it _will_ get the next available
|
||||
software index (in this example, `sw_if_index=7`).
|
||||
|
||||
In Linux CP, I can map one into the other, look at this:
|
||||
In Linux CP, I can see a mapping from one to the other, just look at this:
|
||||
|
||||
```
|
||||
pim@summer:~$ vppctl show lcp
|
||||
@ -400,21 +402,21 @@ VPP takes its sample, it will be doing this on a PHY, that is a given interface
|
||||
_hw_if_index_. When it polls the counters, it'll do it for that specific _hw_if_index_. It now has a
|
||||
choice: should it share with the world the representation of *its* namespace, or should it try to be
|
||||
smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the
|
||||
plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, look up a _LIP_
|
||||
with it. If it finds one, it'll know both the namespace in which it lives as well as the osIndex in
|
||||
that namespace. If it doesn't, it will at least have the _sw_if_index_ at hand, so it'll annotate the
|
||||
USERSOCK with this information.
|
||||
plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, try to look up a
|
||||
_LIP_ with it. If it finds one, it'll know both the namespace in which it lives as well as the
|
||||
osIndex in that namespace. If it doesn't find a _LIP_, it will at least have the _sw_if_index_ at
|
||||
hand, so it'll annotate the USERSOCK counter messages with this information instead.
|
||||
|
||||
Now, `hsflowd` has a choice to make: does it share the Linux representation and hide VPP as an
|
||||
implementation detail? Or does it share the VPP dataplane _sw_if_index_? There are use cases
|
||||
relevant to both, so the decision was to let the operator decide, by setting `osIndex` either `on`
|
||||
(use Linux ifIndex) or `off` (use VPP sw_if_index).
|
||||
(use Linux ifIndex) or `off` (use VPP _sw_if_index_).
|
||||
|
||||
### hsflowd: Host Counters
|
||||
|
||||
Now that I understand the configuration parts of VPP and `hsflowd`, I decide to configure everything
|
||||
but I don't turn on any interfaces yet in VPP. Once I start the daemon, I can see that it
|
||||
sends an UDP packet every 30 seconds to the configured _collector_:
|
||||
but without enabling sFlow on on any interfaces yet in VPP. Once I start the daemon, I can see that
|
||||
it sends an UDP packet every 30 seconds to the configured _collector_:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n
|
||||
@ -587,7 +589,7 @@ endSample ----------------------
|
||||
endDatagram =================================
|
||||
```
|
||||
|
||||
If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might I
|
||||
If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might
|
||||
agree with you. But it is really cool to see that every 30 seconds, the _collector_ will receive
|
||||
this form of heartbeat from the _agent_. There's a lot of vitalsigns in this packet, including some
|
||||
non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version
|
||||
@ -597,10 +599,10 @@ information. It's super dope!
|
||||
|
||||
Next, I'll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to
|
||||
something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed,
|
||||
every ten seconds or so I get a few packets, which I captured them in
|
||||
[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Some of the packets contain only one
|
||||
counter record, while others contain more than one (in the PCAP, packet #9 has two). If I update the
|
||||
polling-interval to every second, I can see that most of the packets have four counters.
|
||||
every ten seconds or so I get a few packets, which I captured in
|
||||
[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Most of the packets contain only one
|
||||
counter record, while some contain more than one (in the PCAP, packet #9 has two). If I update the
|
||||
polling-interval to every second, I can see that most of the packets have all four counters.
|
||||
|
||||
Those interface counters, as decoded by `sflowtool`, look like this:
|
||||
|
||||
@ -638,7 +640,7 @@ endSample ----------------------
|
||||
```
|
||||
|
||||
What I find particularly cool about it, is that sFlow provides an automatic mapping between the
|
||||
`ifName=GigabitEthernet10/0/2` (tag 0:1005), together and an object (tag 0:1), which contains the
|
||||
`ifName=GigabitEthernet10/0/2` (tag 0:1005), together with an object (tag 0:1), which contains the
|
||||
`ifIndex=3`, and lots of packet and octet counters both in the ingress and egress direction. This is
|
||||
super useful for upstream _collectors_, as they can now find the hostname, agent name and address,
|
||||
and the correlation between interface names and their indexes. Noice!
|
||||
@ -661,10 +663,10 @@ Let me take a look at the picture from top to bottom. First, the outer header (f
|
||||
to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as
|
||||
having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it's going to
|
||||
send 9 samples, the first of which says it's from ifIndex=2 and at a sampling rate of 1:1000. It
|
||||
then shows the sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
|
||||
then shows that sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
|
||||
are sampled. Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and
|
||||
DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my `iperf3`,
|
||||
booyah!
|
||||
DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my running
|
||||
`iperf3`, booyah!
|
||||
|
||||
### VPP: sFlow Performance
|
||||
|
||||
@ -779,7 +781,7 @@ an L2 (ethernet) datagram, and immediately transmit it on another interface. The
|
||||
and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as
|
||||
fast as it can!
|
||||
|
||||
Here's how a loadtest looks like when sending 80Gbps at 256b packets on all eight interfaces:
|
||||
Here's how a loadtest looks like when sending 80Gbps at 192b packets on all eight interfaces:
|
||||
|
||||
{{< image src="/assets/sflow/sflow-lab-trex.png" alt="sFlow T-Rex" width="100%" >}}
|
||||
|
||||
@ -787,7 +789,7 @@ The leftmost ports p0 <-> p1 are sending IPv4+MPLS, while ports p0 <-> p2 are se
|
||||
and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These
|
||||
four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 <-> p4 and p5 <-> p6
|
||||
respectively have sFlow turned off but with the same configuration. They are my control, showing
|
||||
the CPU use without sFLow.
|
||||
the CPU use without sFlow.
|
||||
|
||||
**First conclusion**: This stuff works a treat. There is absolutely no impact of throughput at
|
||||
80Gbps with 47.6Mpps either _with_, or _without_ sFlow turned on. That's wonderful news, as it shows
|
||||
@ -812,18 +814,19 @@ Here I can make a few important observations.
|
||||
|
||||
**Baseline**: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which
|
||||
implies that it has a little bit of CPU left over to do other work, if needed. With IPv4, I can see
|
||||
that the throughput is CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
|
||||
that the throughput is actually CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
|
||||
know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The
|
||||
total capacity is 10.11Mpps when sFlow is turned off.
|
||||
total capacity is 10.11Mpps for one worker, when sFlow is turned off.
|
||||
|
||||
**Overhead**: If I turn on sFLow on the interface, VPP will insert the _sflow-node_ into the
|
||||
**Overhead**: When I turn on sFlow on the interface, VPP will insert the _sflow-node_ into the
|
||||
forwarding graph between `device-input` and `ethernet-input`. It means that the sFlow node will see
|
||||
_every single_ packet, and it will have to move all of these into the next node, which costs about
|
||||
9.5 CPU cycles per packet. The regression on L2XC is 3.8% but I have to note that VPP was not CPU
|
||||
bound on the L2XC so it used some CPU cycles which were still available, before regressing
|
||||
throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS.
|
||||
throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS, only to shuffle the
|
||||
packets through the graph.
|
||||
|
||||
**Sampling Cost**: When then doing higher rates of sampling, the further regression is not _that_
|
||||
**Sampling Cost**: But when then doing higher rates of sampling, the further regression is not _that_
|
||||
terrible. Between 1:1'000'000 and 1:10'000, there's barely a noticeable difference. Even in the
|
||||
worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%. The
|
||||
regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS).
|
||||
@ -834,8 +837,8 @@ can be kept well in hand.
|
||||
observe 336k samples taken, and sent to PSAMPLE. At 1:100 however, there are 3.34M samples, but
|
||||
they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream
|
||||
`sflow-main` thread and `hsflowd`. I can see that here, 1.81M samples have been dropped, while 1.53M
|
||||
samples made it through. By the way, this means VPP is happily sending 153K samples/sec to the
|
||||
collector!
|
||||
samples made it through. By the way, this means VPP is happily sending a whopping 153K samples/sec
|
||||
to the collector!
|
||||
|
||||
## What's Next
|
||||
|
||||
|
Reference in New Issue
Block a user