726 lines
41 KiB
Markdown
726 lines
41 KiB
Markdown
---
|
|
date: "2024-09-08T12:51:23Z"
|
|
title: 'VPP with sFlow - Part 1'
|
|
---
|
|
|
|
# Introduction
|
|
|
|
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
|
|
|
|
In January of 2023, an uncomfortably long time ago at this point, an acquaintance of mine called
|
|
Ciprian reached out to me after seeing my [[DENOG
|
|
#14](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)] presentation. He was interested to learn about
|
|
IPFIX and was asking if sFlow would be an option. At the time, there was a plugin in VPP called
|
|
[[flowprobe](https://s3-docs.fd.io/vpp/24.10/cli-reference/clis/clicmd_src_plugins_flowprobe.html)]
|
|
which is able to emit IPFIX records. Unfotunately I never really got it to work well in my tests,
|
|
as either the records were corrupted, sub-interfaces didn't work, or the plugin would just crash the
|
|
dataplane entirely. In the meantime, the folks at [[Netgate](https://netgate.com/)] submitted quite
|
|
a few fixes to flowprobe, but it remains an expensive operation computationally. Wouldn't copying
|
|
one in a thousand or ten thousand packet headers with flow _sampling_ not be just as good?
|
|
|
|
In the months that followed, I discussed the feature with the incredible folks at
|
|
[[inMon](https://inmon.com/)], the original designers and maintainers of the sFlow protocol and
|
|
toolkit. Neil from inMon wrote a prototype and put it on [[GitHub](https://github.com/sflow/vpp)]
|
|
but for lack of time I didn't manage to get it to work, which was largely my fault by the way.
|
|
|
|
However, I have a bit of time on my hands in September and October, and just a few weeks ago,
|
|
my buddy Pavel from [[FastNetMon](https://fastnetmon.com/)] pinged that very dormant thread about
|
|
sFlow being a potentially useful tool for anti DDoS protection using VPP. And I very much agree!
|
|
|
|
## sFlow: Protocol
|
|
|
|
Maintenance of the protocol is performed by the [[sFlow.org](https://sflow.org/)] consortium, the
|
|
authoritative source of the sFlow protocol specifications. The current version of sFlow is v5.
|
|
|
|
sFlow, short for _sampled Flow_, works at the ethernet layer of the stack, where it inspects one in
|
|
N datagrams (typically 1:1000 or 1:10000) going through the physical network interfaces of a device.
|
|
On the device, an **sFlow Agent** does the sampling. For each sample the Agent takes, the first M
|
|
bytes (typically 128) are copied into an sFlow Datagram. Sampling metadata is added, such as
|
|
the ingress (or egress) interface and sampling process parameters. The Agent can then optionally add
|
|
forwarding information (such as router source- and destination prefix, MPLS LSP information, BGP
|
|
communties, and what-not). Finally the Agent will periodically read the octet and packet counters of
|
|
physical network interface(s). Ultimately, the Agent will send the samples and additional
|
|
information over the network as a UDP datagram, to an **sFlow Collector** for further processing.
|
|
|
|
sFlow has been specifically designed to take advantages of the statistical properties of packet
|
|
sampling and can be modeled using statistical sampling theory. This means that the sFlow traffic
|
|
monitoring system will always produce statistically quantifiable measurements. You can read more
|
|
about it in Peter Phaal and Sonia Panchen's
|
|
[[paper](https://sflow.org/packetSamplingBasics/index.htm)], I certainly did and my head spun a
|
|
little bit at the math :)
|
|
|
|
### sFlow: Netlink PSAMPLE
|
|
|
|
sFlow is meant to be a very _lightweight_ operation for the sampling equipment. It can typically be
|
|
done in hardware, but there also exist several software implementations. One very clever thing, I
|
|
think, is decoupling the sampler from the rest of the Agent. The Linux kernel has a packet sampling
|
|
API called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)], which
|
|
allows _producers_ to send samples to a certain _group_, and then allows _consumers_ to subscribe to
|
|
samples of a certrain _group_. The PSAMPLE API uses
|
|
[[NetLink](https://docs.kernel.org/userspace-api/netlink/intro.html)] under the covers. The cool
|
|
thing, for me anyway, is that I have a little bit of experience with Netlink due to my work on VPP's
|
|
[[Linux Control Plane]({{< ref 2021-08-25-vpp-4 >}})] plugin.
|
|
|
|
The idea here is that some **sFlow Agent**, notably a VPP plugin, will be taking periodic samples
|
|
from the physical network interfaces, and producing Netlink messages. Then, some other program,
|
|
notably outside of VPP, can consume these messages and further handle them, creating UDP packets
|
|
with sFlow samples and counters and other information, and sending them to an **sFlow Collector**
|
|
somewhere else on the network.
|
|
|
|
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Warning" >}}
|
|
|
|
There's a handy utility called [[psampletest](https://github.com/sflow/psampletest)] which can
|
|
subscribe to these PSAMPLE netlink groups and retrieve the samples. The first time I used all of
|
|
this stuff, I wasn't aware of this utility and I kept on getting errors. It turns out, there's a
|
|
kernel module that needs to be loaded: `modprobe psample` and `psampletest` helpfully does that for
|
|
you [[ref](https://github.com/sflow/psampletest/blob/main/psampletest.c#L799)], so just make sure
|
|
the module is loaded and added to `/etc/modules` before you spend as many hours as I did pulling out
|
|
hair.
|
|
|
|
## VPP: sFlow Plugin
|
|
|
|
For the purposes of my initial testing, I'll simply take a look at Neil's prototype on
|
|
[[GitHub](https://github.com/sflow/vpp)] and see what I learn in terms of functionality and
|
|
performance.
|
|
|
|
### sFlow Plugin: Anatomy
|
|
|
|
The design is purposefully minimal, to do all of the heavy lifting outside of the VPP dataplane. The
|
|
plugin will create a new VPP _graph node_ called `sflow`, which the operator can insert after
|
|
`device-input`, in other words, if enabled, the plugin will get a copy of all packets that are read
|
|
from an input provider, such as `dpdk-input` or `rdma-input`. The plugin's job is to process the
|
|
packet, and if it's not selected for sampling, just move it onwards to the next node, typically
|
|
`ethernet-input`. Almost all of the interesting action is in `node.c`
|
|
|
|
The kicker is, that one in N packets will be selected to sample, after which:
|
|
1. the ethernet header (`*en`) is extracted from the packet
|
|
1. the input interface (`hw_if_index`) is extracted from the VPP buffer. Remember, sFlow works
|
|
with physical network interfaces!
|
|
1. if there are too many samples from this worker thread being worked on, it is discarded and an
|
|
error counter is incremented. This protects the main thread from being slammed with samples if
|
|
there are simply too many being fished out of the dataplane.
|
|
1. Otherwise:
|
|
* a new `sflow_sample_t` is created, with all the sampling process metadata filled in
|
|
* the first 128 bytes of the packet are copied into the sample
|
|
* an RPC is dispatched to the main thread, which will send the sample to the PSAMPLE channel
|
|
|
|
Both a debug CLI command and API call are added:
|
|
|
|
```
|
|
sflow enable-disable <interface-name> [<sampling_N>]|[disable]
|
|
```
|
|
|
|
Some observations:
|
|
|
|
First off, the sampling_N in Neil's demo is a global rather than per-interface setting. It would
|
|
make sense to make this be per-interface, as routers typically have a mixture of 1G/10G and faster
|
|
100G network cards available. It was a surprise when I set one interface to 1:1000 and the other to
|
|
1:10000 and then saw the first interface change its sampling rate also. It's a small thing, and
|
|
will not be an issue to change.
|
|
|
|
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
|
|
|
Secondly, sending the RPC to main uses `vl_api_rpc_call_main_thread()`, which
|
|
requires a _spinlock_ in `src/vlibmemory/memclnt_api.c:649`. I'm somewhat worried that when many
|
|
samples are sent from many threads, there will be lock contention and performance will suffer.
|
|
|
|
### sFlow Plugin: Functional
|
|
|
|
I boot up the [[IPng Lab]({{< ref 2022-10-14-lab-1 >}})] and install a bunch of sFlow tools on it,
|
|
make sure the `psample` kernel module is loaded. In this first test I'll take a look at
|
|
tablestakes. I compile VPP with the sFlow plugin, and enable that plugin in `startup.conf` on each
|
|
of the four VPP routers. For reference, the Lab looks like this:
|
|
|
|
{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
|
|
|
|
What I'll do is start an `iperf3` server on `vpp0-3` and then hit it from `vpp0-0`, to generate
|
|
a few TCP traffic streams back and forth, which will be traversing `vpp0-2` and `vpp0-1`, like so:
|
|
|
|
```
|
|
pim@vpp0-3:~ $ iperf3 -s -D
|
|
pim@vpp0-0:~ $ iperf3 -c vpp0-3.lab.ipng.ch -t 86400 -P 10 -b 10M
|
|
```
|
|
|
|
### Configuring VPP for sFlow
|
|
|
|
While this `iperf3` is running, I'll log on to `vpp0-2` to take a closer look. The first thing I do,
|
|
is turn on packet sampling on `vpp0-2`'s interface that points at `vpp0-3`, which is `Gi10/0/1`, and
|
|
the interface that points at `vpp0-0`, which is `Gi10/0/0`. That's easy enough, and I will use a
|
|
sampling rate of 1:1000 as these interfaces are GigabitEthernet:
|
|
|
|
```
|
|
root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/0 1000
|
|
root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/1 1000
|
|
root@vpp0-2:~# vppctl show run | egrep '(Name|sflow)'
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
sflow active 5656 24168 0 9.01e2 4.27
|
|
```
|
|
|
|
Nice! VPP inserted the `sflow` node between `dpdk-input` and `ethernet-input` where it can do its
|
|
business. But is it sending data? To answer this question, I can first take a look at the
|
|
`psampletest` tool:
|
|
|
|
```
|
|
root@vpp0-2:~# psampletest
|
|
pstest: modprobe psample returned 0
|
|
pstest: netlink socket number = 1637
|
|
pstest: getFamily
|
|
pstest: generic netlink CMD = 1
|
|
pstest: generic family name: psample
|
|
pstest: generic family id: 32
|
|
pstest: psample attr type: 4 (nested=0) len: 8
|
|
pstest: psample attr type: 5 (nested=0) len: 8
|
|
pstest: psample attr type: 6 (nested=0) len: 24
|
|
pstest: psample multicast group id: 9
|
|
pstest: psample multicast group: config
|
|
pstest: psample multicast group id: 10
|
|
pstest: psample multicast group: packets
|
|
pstest: psample found group packets=10
|
|
pstest: joinGroup 10
|
|
pstest: received Netlink ACK
|
|
pstest: joinGroup 10
|
|
pstest: set headers...
|
|
pstest: serialize...
|
|
pstest: print before sending...
|
|
pstest: psample netlink (type=32) CMD = 0
|
|
pstest: grp=1 in=7 out=9 n=1000 seq=1 pktlen=1514 hdrlen=31 pkt=0x558c08ba4958 q=3 depth=33333333 delay=123456
|
|
pstest: send...
|
|
pstest: send_psample getuid=0 geteuid=0
|
|
pstest: sendmsg returned 140
|
|
pstest: free...
|
|
pstest: start read loop...
|
|
pstest: psample netlink (type=32) CMD = 0
|
|
pstest: grp=1 in=1 out=0 n=1000 seq=600320 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
|
pstest: psample netlink (type=32) CMD = 0
|
|
pstest: grp=1 in=1 out=0 n=1000 seq=600321 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
|
pstest: psample netlink (type=32) CMD = 0
|
|
pstest: grp=1 in=1 out=0 n=1000 seq=600322 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
|
pstest: psample netlink (type=32) CMD = 0
|
|
pstest: grp=1 in=2 out=0 n=1000 seq=600423 pktlen=66 hdrlen=70 pkt=0x7ffdb0d5a1e8 q=0 depth=0 delay=0
|
|
pstest: psample netlink (type=32) CMD = 0
|
|
pstest: grp=1 in=1 out=0 n=1000 seq=600324 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
|
```
|
|
|
|
I am amazed! The `psampletest` output shows a few packets, considering I'm asking `iperf3` to push
|
|
100Mbit using 9000 byte jumboframes (which would be something like 1400 packets/second), I can
|
|
expect two or three samples per second. I immediately notice a few things:
|
|
|
|
***1. Network Namespace***: The Netlink sampling channel belongs to a network _namespace_. The VPP
|
|
process is running in the _default_ netns, so its PSAMPLE netlink messages will be in that namespace.
|
|
Thus, the `psampletest` and other tools must also run in that namespace. I mention this because in
|
|
Linux CP, often times the controlplane interfaces are created in a dedicated `dataplane` network
|
|
namespace.
|
|
|
|
***2. pktlen and hdrlen***: The pktlen is wrong, and this is a bug. In VPP, packets are put into
|
|
buffers of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for
|
|
the same packet. The packet length here ought to be 9000 in one direction. Looking at the `in=2`
|
|
packet with length 66, that looks like a legitimate ACK packet on the way back. But why is the
|
|
hdrlen set to 70 there? I'm going to want to ask Neil about that.
|
|
|
|
***3. ingress and egress***: The `in=1` and one packet with `in=2` represent the input `hw_if_index`
|
|
which is the ifIndex that VPP assigns to its devices. And looking at `show interfaces`, indeed
|
|
number 1 corresponds with `GigabitEthernet10/0/0` and 2 is `GigabitEthernet10/0/1`, which checks
|
|
out:
|
|
```
|
|
root@vpp0-2:~# vppctl show int
|
|
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
|
|
GigabitEthernet10/0/0 1 up 9000/0/0/0 rx packets 469552764
|
|
rx bytes 4218754400233
|
|
tx packets 133717230
|
|
tx bytes 8887341013
|
|
drops 6050
|
|
ip4 469321635
|
|
ip6 225164
|
|
GigabitEthernet10/0/1 2 up 9000/0/0/0 rx packets 133527636
|
|
rx bytes 8816920909
|
|
tx packets 469353481
|
|
tx bytes 4218736200819
|
|
drops 6060
|
|
ip4 133489925
|
|
ip6 29139
|
|
|
|
```
|
|
|
|
***4. ifIndexes are orthogonal***: These `in=1` or `in=2` ifIndex numbers are constructs of the VPP
|
|
dataplane. Notably, VPP's numbering of interface index is strictly _orthogonal_ to Linux, and it's
|
|
not guaranteed that there even _exists_ an interface in Linux for the PHY upon which the sampling is
|
|
happening. Said differently, `in=1` here is meant to reference VPP's `GigabitEthernet10/0/0`
|
|
interface, but in Linux, `ifIndex=1` is a completely different interface (`lo`) in the default
|
|
network namespace. Similarly `in=2` for VPP's `Gi10/0/1` interface corresponds to interface `enp1s0`
|
|
in Linux:
|
|
|
|
```
|
|
root@vpp0-2:~# ip link
|
|
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
|
|
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
|
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
|
|
link/ether 52:54:00:f0:01:20 brd ff:ff:ff:ff:ff:ff
|
|
```
|
|
|
|
***5. Counters***: sFlow periodically polls the interface counters for all interfaces. It will
|
|
normally use `/proc/net/` entries for that, but there are two problems with this:
|
|
|
|
1. There may not exist a Linux representation of the interface, for example if it's only doing L2
|
|
bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane
|
|
interface, or `linux-cp` is not used at all.
|
|
|
|
1. Even if it does exist and it's the "correct" ifIndex in Linux, for example if the _Linux
|
|
Interface Pair_'s tuntap `host_vif_index` index is used, even then the statistics counters in the
|
|
Linux representation will only count packets and octets of _punted_ packets, that is to say, the
|
|
stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device. Important
|
|
to note that east-west traffic that goes _through_ the dataplane, is never punted to Linux, and as
|
|
such, the counters will be undershooting: only counting traffic _to_ the router, not _through_ the
|
|
router.
|
|
|
|
### VPP sFlow: Performance
|
|
|
|
Now that I've shown that Neil's proof of concept works, I will take a better look at the performance
|
|
of the plugin. I've made a mental note that the plugin sends RPCs from worker threads to the main
|
|
thread to marshall the PSAMPLE messages out. I'd like to see how expensive that is, in general. So,
|
|
I pull boot two Dell R730 machines in IPng's Lab and put them to work. The first machine will run
|
|
Cisco's T-Rex loadtester with 8x 10Gbps ports (4x dual Intel 58299), while the second (identical)
|
|
machine will run VPP also ith 8x 10Gbps ports (2x Intel i710-DA4).
|
|
|
|
I will test a bunch of things in parallel. First off, I'll test L2 (xconnect) and L3 (IPv4 routing),
|
|
and secondly I'll test that with and without sFlow turned on. This gives me 8 ports to configure,
|
|
and I'll start with the L2 configuration, as follows:
|
|
|
|
```
|
|
vpp# set int state TenGigabitEthernet3/0/2 up
|
|
vpp# set int state TenGigabitEthernet3/0/3 up
|
|
vpp# set int state TenGigabitEthernet130/0/2 up
|
|
vpp# set int state TenGigabitEthernet130/0/3 up
|
|
vpp# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
|
|
vpp# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
|
|
vpp# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
|
|
vpp# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
|
|
```
|
|
|
|
Then, the L3 configuration looks like this:
|
|
|
|
```
|
|
vpp# lcp create TenGigabitEthernet3/0/0 host-if xe0-0
|
|
vpp# lcp create TenGigabitEthernet3/0/1 host-if xe0-1
|
|
vpp# lcp create TenGigabitEthernet130/0/0 host-if xe1-0
|
|
vpp# lcp create TenGigabitEthernet130/0/1 host-if xe1-1
|
|
vpp# set int state TenGigabitEthernet3/0/0 up
|
|
vpp# set int state TenGigabitEthernet3/0/1 up
|
|
vpp# set int state TenGigabitEthernet130/0/0 up
|
|
vpp# set int state TenGigabitEthernet130/0/1 up
|
|
vpp# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
|
|
vpp# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
|
|
vpp# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
|
|
vpp# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
|
|
vpp# ip route add 16.0.0.0/24 via 100.64.0.0
|
|
vpp# ip route add 48.0.0.0/24 via 100.64.1.0
|
|
vpp# ip route add 16.0.2.0/24 via 100.64.4.0
|
|
vpp# ip route add 48.0.2.0/24 via 100.64.5.0
|
|
vpp# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
|
|
vpp# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
|
|
vpp# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
|
|
vpp# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
|
|
```
|
|
|
|
And finally, the Cisco T-Rex configuration looks like this:
|
|
|
|
```
|
|
- version: 2
|
|
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
|
|
port_info:
|
|
- src_mac: 00:1b:21:06:00:00
|
|
dest_mac: 9c:69:b4:61:a1:dc
|
|
- src_mac: 00:1b:21:06:00:01
|
|
dest_mac: 9c:69:b4:61:a1:dd
|
|
|
|
- src_mac: 00:1b:21:83:00:00
|
|
dest_mac: 00:1b:21:83:00:01
|
|
- src_mac: 00:1b:21:83:00:01
|
|
dest_mac: 00:1b:21:83:00:00
|
|
|
|
- src_mac: 00:1b:21:87:00:00
|
|
dest_mac: 9c:69:b4:61:75:d0
|
|
- src_mac: 00:1b:21:87:00:01
|
|
dest_mac: 9c:69:b4:61:75:d1
|
|
|
|
- src_mac: 9c:69:b4:85:00:00
|
|
dest_mac: 9c:69:b4:85:00:01
|
|
- src_mac: 9c:69:b4:85:00:01
|
|
dest_mac: 9c:69:b4:85:00:00
|
|
```
|
|
|
|
A little note on the use of `ip neighbor` in VPP and specific `dest_mac` in T-Rex. In L2 mode,
|
|
because the VPP interfaces will be in promiscuous mode and simply pass through any ethernet frame
|
|
received on interface `Te3/0/2` and copy it out on `Te3/0/3` and vice-versa, there is no need to
|
|
tinker with MAC addresses. But in L3 mode, the NIC will only accept ethernet frames addressed to its
|
|
MAC address, so you can see that for the first port in T-Rex, I am setting `dest_mac:
|
|
9c:69:b4:61:a1:dc` which is the MAC address of `Te3/0/0` on VPP. And then on the way out, if VPP
|
|
wants to send traffic back to T-Rex, I'll give it a static ARP entry with `ip neighbor .. static`.
|
|
|
|
With that said, I can start a baseline loadtest like so:
|
|
{{< image width="100%" src="/assets/sflow/trex-baseline.png" alt="Cisco T-Rex: baseline" >}}
|
|
|
|
T-Rex is sending 10Gbps out on all eight interfaces (four of which are L3 routing, and four of which
|
|
are L2 xconnecting), using packet size of 1514 bytes. This amounts of roughlu 813Kpps per port, or a
|
|
cool 6.51Mpps in total. And I can see, in this base line configuration, the VPP router is happy to
|
|
do the work.
|
|
|
|
I then enable sFlow on the second set of four ports, using a 1:1000 sampling rate:
|
|
|
|
```
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000
|
|
```
|
|
|
|
This should yield about 3'250 or so samples per second, and things look pretty great:
|
|
|
|
```
|
|
pim@hvn6-lab:~$ vppctl show err
|
|
Count Node Reason Severity
|
|
5034508 sflow sflow packets processed error
|
|
4908 sflow sflow packets sampled error
|
|
5034508 sflow sflow packets processed error
|
|
5111 sflow sflow packets sampled error
|
|
5034516 l2-output L2 output packets error
|
|
5034516 l2-input L2 input packets error
|
|
5034404 sflow sflow packets processed error
|
|
4948 sflow sflow packets sampled error
|
|
5034404 l2-output L2 output packets error
|
|
5034404 l2-input L2 input packets error
|
|
5034404 sflow sflow packets processed error
|
|
4928 sflow sflow packets sampled error
|
|
5034404 l2-output L2 output packets error
|
|
5034404 l2-input L2 input packets error
|
|
5034516 l2-output L2 output packets error
|
|
5034516 l2-input L2 input packets error
|
|
```
|
|
|
|
I can see that the `sflow packets sampled` is roughly 0.1% of the `sflow packets processed` which
|
|
checks out. I can also see in `psampletest` a flurry of activity, so I'm happy:
|
|
|
|
```
|
|
pim@hvn6-lab:~$ sudo psampletest
|
|
...
|
|
pstest: grp=1 in=9 out=0 n=1000 seq=63388 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
|
pstest: psample netlink (type=32) CMD = 0
|
|
pstest: grp=1 in=8 out=0 n=1000 seq=63389 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
|
pstest: psample netlink (type=32) CMD = 0
|
|
pstest: grp=1 in=11 out=0 n=1000 seq=63390 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
|
pstest: psample netlink (type=32) CMD = 0
|
|
pstest: grp=1 in=10 out=0 n=1000 seq=63391 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
|
pstest: psample netlink (type=32) CMD = 0
|
|
pstest: grp=1 in=11 out=0 n=1000 seq=63392 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
|
```
|
|
|
|
I confirm that all four `in` interfaced (8, 9, 10 and 11) are sending samples, and those indexes
|
|
correctly correspond to the VPP dataplane's `sw_if_index` for `TenGig130/0/0 - 3`. Sweet! On this
|
|
machine, each TenGig network interface has its own dedicated VPP worker thread. Considering I
|
|
turned on sFlow sampling on four interfaces, I should see the cost I'm paying for the feature:
|
|
|
|
```
|
|
pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
sflow active 3908218 14350684 0 9.05e1 3.67
|
|
sflow active 3913266 14350680 0 1.11e2 3.67
|
|
sflow active 3910828 14350687 0 1.08e2 3.67
|
|
sflow active 3909274 14350692 0 5.66e1 3.67
|
|
```
|
|
|
|
Alright, so for the 999 packets that went through and the one packet that got sampled, on average
|
|
VPP is spending between 90 and 111 CPU cycles per packet, and the loadtest looks squeaky clean on
|
|
T-Rex.
|
|
|
|
### VPP sFlow: Cost of passthru
|
|
|
|
I decide to take a look at two edge cases. What if there are no samples being taken at all, and the
|
|
`sflow` node is merely passing through all packets to `ethernet-input`? To simulate this, I will set
|
|
up a bizarrely high sampling rate, say one in ten million. I'll also make the T-Rex loadtester use
|
|
only four ports, in other words, a unidirectional loadtest, and I'll make it go much faster by
|
|
sending smaller packets, say 128 bytes:
|
|
|
|
```
|
|
tui>start -f stl/ipng.py -p 0 2 4 6 -m 99% -t size=128
|
|
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000 disable
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000 disable
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000 disable
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000 disable
|
|
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10000000
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10000000
|
|
```
|
|
|
|
The loadtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the
|
|
`sFlow` plugin is not sampling many packets:
|
|
|
|
```
|
|
pim@hvn6-lab:~$ vppctl show err
|
|
Count Node Reason Severity
|
|
59777084 sflow sflow packets processed error
|
|
6 sflow sflow packets sampled error
|
|
59777152 l2-output L2 output packets error
|
|
59777152 l2-input L2 input packets error
|
|
59777104 sflow sflow packets processed error
|
|
6 sflow sflow packets sampled error
|
|
59777104 l2-output L2 output packets error
|
|
59777104 l2-input L2 input packets error
|
|
|
|
pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
sflow active 8186642 369674664 0 1.35e1 45.16
|
|
sflow active 25173660 369674696 0 1.97e1 14.68
|
|
```
|
|
Two observations:
|
|
|
|
1. One of these is busier than the other. Without looking further, I can already predict that the
|
|
top one (doing 45.16 vectors/call) is the L3 thread. Reasoning: the L3 code path through the
|
|
dataplane is a lot more expensive than 'merely' L2 XConnect. As such, the packets will spend more
|
|
time, and therefore the iterations of the `dpdk-input` loop will be further apart in time. And
|
|
because of that, it'll end up consuming more packets on each subsequent iteration, in order to catch
|
|
up. The L2 path on the other hand, is quicker and therefore will have less packets waiting on
|
|
subsequent iterations of `dpdk-input`.
|
|
|
|
2. The `sflow` plugin spends between 13.5 and 19.7 CPU cycles shoveling the packets into
|
|
`ethernet-input` without doing anything to them. That's pretty low! And the L3 path is a little bit
|
|
more efficient per packet, which is very likely because it gets to amort its L1/L2 CPU instruction
|
|
cache over 45 packets each time it runs, while the L2 path can only amort its instruction cache over
|
|
15 or so packets each time it runs.
|
|
|
|
I let the loadtest run overnight,and the proof is in the pudding: sFlow enabled but not sampling
|
|
works just fine:
|
|
|
|
{{< image width="100%" src="/assets/sflow/trex-passthru.png" alt="Cisco T-Rex: passthru" >}}
|
|
|
|
### VPP sFlow: Cost of sampling
|
|
|
|
The other interesting case is to figure out how much CPU it takes to execute the code path
|
|
with the actual sampling. This one turns out a bit trickier to measure. While leaving the previous
|
|
loadtest running at 33.5Mpps, I disable sFlow and then re-enable it at an abnormally _high_ ratio of
|
|
1:10 packets:
|
|
|
|
```
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 disable
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 disable
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10
|
|
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10
|
|
```
|
|
|
|
The T-Rex view immediately reveals that VPP is not doing very well, as the throughput went from
|
|
33.5Mpps all the way down to 7.5Mpps. Ouch! Looking at the dataplane:
|
|
|
|
```
|
|
pim@hvn6-lab:~$ vppctl show err | grep sflow
|
|
340502528 sflow sflow packets processed error
|
|
12254462 sflow sflow packets dropped error
|
|
22611461 sflow sflow packets sampled error
|
|
422527140 sflow sflow packets processed error
|
|
8533855 sflow sflow packets dropped error
|
|
34235952 sflow sflow packets sampled error
|
|
```
|
|
|
|
Ha, this new safeguard popped up: remember all the way at the beginning, I explained how there's a
|
|
safety net in the `sflow` plugin that will pre-emptively drop the sample if the RPC channel towards
|
|
the main thread is seeing too many outstanding RPCs? That's happening right now, under the moniker
|
|
`sflow packets dropped`, and it's roughly *half* of the samples.
|
|
|
|
My first attempt is to back off the loadtester to roughly 1.5Mpps per port (so 6Mpps in total, under the
|
|
current limit of 7.5Mpps), but I'm disappointed: the VPP instance is now returning 665Kpps per port
|
|
only, which is horrible, and it's still dropping samples.
|
|
|
|
My second attempt is to turn off all ports but last pair (the L2XC port), which returns 930Kpps from
|
|
the offered 1.5Mpps. VPP is clearly not having a good time here.
|
|
|
|
Finally, as a validation, I turn off all ports but the first pair (the L3 port, without sFlow), and
|
|
ramp up the traffic to 8Mpps. Success (unsurprising to me). I also ramp up the second pair (the L2XC
|
|
port, without sFlow), VPP forwards all 16Mpps and is happy again.
|
|
|
|
Once I turn on the third pair (the L3 port, _with_ sFlow), even at 1Mpps, the whole situation
|
|
regresses again: First two ports go down from 8Mpps to 5.2Mpps each; the third (offending) port
|
|
delivers 740Kpps out of 1Mpps. Clearly, there's some work to do under high load situations!
|
|
|
|
#### Reasoning about the bottle neck
|
|
|
|
But how expensive is sending samples, really? To try to get at least some pseudo-scientific answer I
|
|
turn off all ports again, and ramp up the one port pair with (L3 + sFlow at 1:10 ratio) to full line
|
|
rate: that is 64 byte packets at 14.88Mpps:
|
|
|
|
```
|
|
tui>stop
|
|
tui>start -f stl/ipng.py -m 100% -p 4 -t size=64
|
|
```
|
|
|
|
VPP is now on the struggle bus and is returning 3.16Mpps or 21% of that. But, I think it'll give me
|
|
some reasonable data to try to feel out where the bottleneck is.
|
|
|
|
```
|
|
Thread 2 vpp_wk_1 (lcore 3)
|
|
Time 6.3, 10 sec internal node vector rate 256.00 loops/sec 27310.73
|
|
vector rates in 3.1607e6, out 3.1607e6, drop 0.0000e0, punt 0.0000e0
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
TenGigabitEthernet130/0/1-outp active 77906 19943936 0 5.79e0 256.00
|
|
TenGigabitEthernet130/0/1-tx active 77906 19943936 0 6.88e1 256.00
|
|
dpdk-input polling 77906 19943936 0 4.41e1 256.00
|
|
ethernet-input active 77906 19943936 0 2.21e1 256.00
|
|
ip4-input active 77906 19943936 0 2.05e1 256.00
|
|
ip4-load-balance active 77906 19943936 0 1.07e1 256.00
|
|
ip4-lookup active 77906 19943936 0 1.98e1 256.00
|
|
ip4-rewrite active 77906 19943936 0 1.97e1 256.00
|
|
sflow active 77906 19943936 0 6.14e1 256.00
|
|
|
|
pim@hvn6-lab:pim# vppctl show err | grep sflow
|
|
551357440 sflow sflow packets processed error
|
|
19829380 sflow sflow packets dropped error
|
|
36613544 sflow sflow packets sampled error
|
|
```
|
|
|
|
OK, the `sflow` plugin saw 551M packets, selected 36.6M of them for sampling, but ultimately only
|
|
sent RPCs to the main thread for 16.8M samples after having dropped 19.8M of them. There are three
|
|
code paths, each one extending the other:
|
|
|
|
1. Super cheap: pass through. I already learned that it takes about X=13.5 CPU cycles to pass
|
|
through a packet.
|
|
1. Very cheap: select sample and construct the RPC, but toss it, costing Y CPU cycles.
|
|
1. Expensive: select sample, and send the RPC. Z CPU cycles in worker, and another amount in main.
|
|
|
|
Now I don't know what Y is, but seeing as the selection only copies some data from the VPP buffer
|
|
into a new `sflow_sample_t`, and it uses `clip_memcpy_fast()` for the sample header, I'm going to
|
|
assume it's not _drastically_ more expensive than the super cheap case, so for simplicity I'll
|
|
guesstimate that it takes Y=20 CPU cyces.
|
|
|
|
With that guess out of the way, I can see what the `sflow` plugin is consuming for the third case:
|
|
|
|
```
|
|
AvgClocks = (Total * X + Sampled * Y + RPCSent * Z) / Total
|
|
|
|
61.4 = ( 551357440 * 13.5 + 36613544 * 20 + (36613544-19829380) * Z ) / 551357440
|
|
61.4 = ( 7443325440 + 732270880 + 16784164 * Z ) / 551357440
|
|
33853346816 = 7443325440 + 732270880 + 16784164 * Z
|
|
25677750496 = 16784164 * Z
|
|
Z = 1529
|
|
```
|
|
|
|
Good to know! I find spending O(1500) cycles to send the sample pretty reasonable. However, for a
|
|
dataplane that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220
|
|
CPU cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets
|
|
feels dangerous to me.
|
|
|
|
Here's where I start my conjecture. If I count the CPU cycles spent in the table above, I will see
|
|
273 CPU cycles spent on average per packet. The CPU in the VPP router is an `E5-2696 v4 @ 2.20GHz`,
|
|
which means it should be able to do `2.2e10/273 = 8.06Mpps` per thread, more than double that what I
|
|
observe (3.16Mpps)! But, for all the `vector rates in` (3.1607e6), it also managed to emit the
|
|
packets back out (same number: 3.1607e6).
|
|
|
|
So why isn't VPP getting more packets from DPDK? I poke around a bit and find an important clue:
|
|
|
|
```
|
|
pim@hvn6-lab:~$ vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed; \
|
|
sleep 10; vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed
|
|
rx missed 4065539464
|
|
rx missed 4182788310
|
|
```
|
|
|
|
In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. I already measured that it
|
|
forwarded 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted
|
|
for! It's just, DPDK never managed to read them from the hardware: `sad-trombone.wav`
|
|
|
|
|
|
As a validation, I turned off sFlow while keeping that one port at 14.88Mpps. Now, 10.8Mpps were
|
|
delivered:
|
|
|
|
```
|
|
Thread 2 vpp_wk_1 (lcore 3)
|
|
Time 14.7, 10 sec internal node vector rate 256.00 loops/sec 40622.64
|
|
vector rates in 1.0794e7, out 1.0794e7, drop 0.0000e0, punt 0.0000e0
|
|
Name State Calls Vectors Suspends Clocks Vectors/Call
|
|
TenGigabitEthernet130/0/1-outp active 620012 158723072 0 5.66e0 256.00
|
|
TenGigabitEthernet130/0/1-tx active 620012 158723072 0 7.01e1 256.00
|
|
dpdk-input polling 620012 158723072 0 4.39e1 256.00
|
|
ethernet-input active 620012 158723072 0 1.56e1 256.00
|
|
ip4-input-no-checksum active 620012 158723072 0 1.43e1 256.00
|
|
ip4-load-balance active 620012 158723072 0 1.11e1 256.00
|
|
ip4-lookup active 620012 158723072 0 2.00e1 256.00
|
|
ip4-rewrite active 620012 158723072 0 2.02e1 256.00
|
|
```
|
|
|
|
Total Clocks: 201 per packet; 2.2GHz/201 = 10.9Mpps, and I am observing 10.8Mpps. As [[North of the
|
|
Border](https://www.youtube.com/c/NorthoftheBorder)] would say: "That's not just good, it's good
|
|
_enough_!"
|
|
|
|
For completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps 🥰), and saw that
|
|
about 29Mpps of that made it through. Interestingly, what was 3.16Mpps in the single-port line rate
|
|
loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow worker
|
|
threads are also impacted. I spent some time thinking about this and poking around, but I did not
|
|
find a good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted.
|
|
Here's a screenshot of VPP on the struggle bus:
|
|
|
|
{{< image width="100%" src="/assets/sflow/trex-overload.png" alt="Cisco T-Rex: overload at line rate" >}}
|
|
|
|
**Hypothesis**: Due to the _spinlock_ in `vl_api_rpc_call_main_thread()`, the worker CPU is pegged
|
|
for a longer time, during which the `dpdk-input` PMD can't run, so it misses out on these sweet
|
|
sweet packets that the network card had dutifully received for it, resulting in the `rx-miss`
|
|
situation. While VPP's performance measurement shows 273 CPU cycles per packet and 3.16Mpps, this
|
|
accounts only for 862M cycles, while the thread has 2200M cycles, leaving a whopping 60% of CPU
|
|
cycles unused in the dataplane. I still don't understand why _other_ worker threads are impacted,
|
|
though.
|
|
|
|
## What's Next
|
|
|
|
I'll continue to work with the folks in the sFlow and VPP communities and iterate on the plugin and
|
|
other **sFlow Agent** machinery. In an upcoming article, I hope to share more details on how to tie
|
|
the VPP plugin in to the `hsflowd` host sflow daemon in a way that the interface indexes, counters
|
|
and packet lengths are all correct. Of course, the main improvement that we can make is to allow for
|
|
the system to work better under load, which will take some thinking.
|
|
|
|
I should do a few more tests with a debug binary and profiling turned on. I quickly ran a `perf`
|
|
over the VPP (release / optimized) binary running on the bench, but it merely said 80% of time was
|
|
spent in `libvlib` rather than `libvnet` in the baseline (sFlow turned off).
|
|
|
|
```
|
|
root@hvn6-lab:/home/pim# perf record -p 1752441 sleep 10
|
|
root@hvn6-lab:/home/pim# perf report --stdio --sort=dso
|
|
# Overhead Shared Object (sFlow) Overhead Shared Object (baseline)
|
|
# ........ ...................... ........ ........................
|
|
#
|
|
79.02% libvlib.so.24.10 54.27% libvlib.so.24.10
|
|
12.82% libvnet.so.24.10 33.91% libvnet.so.24.10
|
|
3.77% dpdk_plugin.so 10.87% dpdk_plugin.so
|
|
3.21% [kernel.kallsyms] 0.81% [kernel.kallsyms]
|
|
0.29% sflow_plugin.so 0.09% ld-linux-x86-64.so.2
|
|
0.28% libvppinfra.so.24.10 0.03% libc.so.6
|
|
0.21% libc.so.6 0.01% libvppinfra.so.24.10
|
|
0.17% libvlibapi.so.24.10 0.00% libvlibmemory.so.24.10
|
|
0.15% libvlibmemory.so.24.10
|
|
0.07% ld-linux-x86-64.so.2
|
|
0.00% vpp
|
|
0.00% [vdso]
|
|
0.00% libsvm.so.24.10
|
|
```
|
|
|
|
Unfortunately, I'm not much of a profiler expert, being merely a network engineer :) so I may have
|
|
to ask for help. Of course, if you're reading this, you may also _offer_ help! There's lots of
|
|
interesting work to do on this `sflow` plugin, with matching ifIndex for consumers like `hsflowd`,
|
|
reading interface counters from the dataplane (or from the Prometheus Exporter), and most
|
|
importantly, ensuring it works well, or fails gracefully, under stringent load.
|
|
|
|
From the _cray-cray_ ideas department, what if we:
|
|
1. In worker thread, produced the sample but instead of sending an RPC to main and taking the
|
|
lock, append it to a producer sample queue and move on. This way, no locks are needed, and each
|
|
worker thread will have its own producer queue.
|
|
|
|
1. Create a separate worker (or even pool of workers), running on possibly a different CPU (or in
|
|
main), that runs a loop iterating on all sflow sample queues consuming the samples and sending them
|
|
in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too many coming in.
|
|
|
|
I'm reminded that this pattern exists already -- async crypto workers create a `crypto-dispatch`
|
|
node that acts as poller for inbound crypto, and it hands off the result back into the worker
|
|
thread: lockless at the expense of some complexity!
|
|
|
|
## Acknowledgements
|
|
|
|
The plugin I am testing here is a prototype written by Neil McKee of inMon. I also wanted to say
|
|
thanks to Pavel Odintsov of FastNetMon and Ciprian Balaceanu for showing an interest in this plugin,
|
|
and Peter Phaal for facilitating a get-together last year.
|
|
|
|
Who's up for making this thing a reality?!
|