This commit is contained in:
701
content/articles/2024-09-08-sflow-1.md
Normal file
701
content/articles/2024-09-08-sflow-1.md
Normal file
@ -0,0 +1,701 @@
|
||||
---
|
||||
date: "2024-09-08T12:51:23Z"
|
||||
title: 'VPP with sFlow - Part 1'
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
||||
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
|
||||
|
||||
In January of 2023, an uncomfortably long time ago at this point, an acquaintance of mine called
|
||||
Ciprian reached out to me after seeing my [[DENOG
|
||||
#14](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)] presentation. He was interested to learn about
|
||||
IPFIX and was asking if sFlow would be an option. At the time, there was a plugin in VPP called
|
||||
[[flowprobe](https://s3-docs.fd.io/vpp/24.10/cli-reference/clis/clicmd_src_plugins_flowprobe.html)]
|
||||
which is able to emit IPFIX records. Unfotunately I never really got it to work well in my tests,
|
||||
as either the records were corrupted, sub-interfaces didn't work, or the plugin would just crash the
|
||||
dataplane entirely. In the meantime, the folks at [[Netgate](https://netgate.com/)] submitted quite
|
||||
a few fixes to flowprobe, but it remains an expensive operation computationally. Wouldn't copying
|
||||
one in a thousand or ten thousand packet headers with flow _sampling_ not be just as good?
|
||||
|
||||
In the months that followed, I discussed the feature with the incredible folks at
|
||||
[[inMon](https://inmon.com/)], the original designers and maintainers of the sFlow protocol and
|
||||
toolkit. Neil from inMon wrote a prototype and put it on [[GitHub](https://githubn.com/sflow/vpp)]
|
||||
but for lack of time I didn't manage to get it to work.
|
||||
|
||||
However, I have a bit of time on my hands in September and October, and just a few weeks ago,
|
||||
my buddy Pavel from [[FastNetMon](https://fastnetmon.com/)] pinged that very dormant thread about
|
||||
sFlow being a potentially useful tool for anti DDoS protection using VPP. And I very much agree!
|
||||
|
||||
## sFlow: Protocol
|
||||
|
||||
Maintenance of the protocol is performed by the [[sFlow.org](https://sflow.org/)] consortium, the
|
||||
authoritative source of the sFlow protocol specifications. The current version of sFlow is v5.
|
||||
|
||||
sFlow, short for _sampled Flow_, works at the ethernet layer of the stack, where it inspects one in
|
||||
N datagrams (typically 1:1000 or 1:10000) going through the physical network interfaces of a device.
|
||||
On the device, an **sFlow Agent** does the sampling. For each sample the Agent takes, the first M
|
||||
bytes (typically 128) are copied into an sFlow Datagram. Sampling metadata is added, such as
|
||||
the ingress (or egress) interface and sampling process parameters. The Agent can then optionally add
|
||||
forwarding information (such as router source- and destination prefix, and what-not, MPLS LSP
|
||||
information, BGP communties, and what-not). Finally the Agent will will periodically read the octet
|
||||
and packet counters of the physical network interface. Ultimately, the Agent will send the samples
|
||||
and additional information to an **sFlow Collector** for further processing.
|
||||
|
||||
sFlow has been specifically designed to take advantages of the statistical properties of packet
|
||||
sampling and can be modeled using statistical sampling theory. This means that the sFlow traffic
|
||||
monitoring system will always produce statistically quantifiable measurements. You can read more
|
||||
about it in Peter Phaal and Sonia Panchen's
|
||||
[[paper](https://sflow.org/packetSamplingBasics/index.htm)], I certainly did :)
|
||||
|
||||
### sFlow: Netlink PSAMPLE
|
||||
|
||||
sFlow is meant to be a very _lightweight_ operation for the sampling equipment. It can typically be
|
||||
done in hardware, but there also exist several software implementations. One very clever thing, I
|
||||
think, is decoupling the sampler from the rest of the Agent. The Linux kernel has a packet sampling
|
||||
API called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)], which
|
||||
allows _producers_ to send samples to a certain _group_, and then allows _consumers_ to subscribe to
|
||||
samples of a certrain _group_. The PSAMPLE API uses
|
||||
[[NetLink](https://docs.kernel.org/userspace-api/netlink/intro.html)] under the covers. The cool
|
||||
thing, for me anyway, is that I have a little bit of experience with Netlink due to my work on VPP's
|
||||
[[Linux Control Plane]({{< ref 2021-08-25-vpp-4 >}})] plugin.
|
||||
|
||||
The idea here is that some **sFlow Agent**, notably a VPP plugin, will be taking periodic samples
|
||||
from the network interfaces, and generating Netlink messages. Then, some other program, notably
|
||||
outside of VPP, can subscribe to these messages and further handle them, creating UDP packets with
|
||||
sFlow samples and counters and other information, and sending them to an **sFlow Collector**
|
||||
somewhere else on the network.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Warning" >}}
|
||||
|
||||
There's a handy utility called [[psampletest](https://github.com/sflow/psampletest)] which can
|
||||
subscribe to these PSAMPLE netlink groups and retrieve the samples. The first time I used all of
|
||||
this stuff, I kept on getting errors. It turns out, there's a kernel module that needs to be loaded:
|
||||
`modprobe psample` and make sure it's in `/etc/modules` before you spend as many hours as I did
|
||||
pulling out hair.
|
||||
|
||||
## VPP: sFlow Plugin
|
||||
|
||||
For the purposes of my initial testing, I'll simply take a look at Neil's prototype on
|
||||
[[GitHub](https://githubn.com/sflow/vpp)] and see what I learn in terms of functionality and
|
||||
performance.
|
||||
|
||||
### sFlow Plugin: Anatomy
|
||||
|
||||
The design is purposefully minimal, to do all of the heavy lifting outside of the VPP dataplane. The
|
||||
plugin will create a new VPP _graph node_ called `sflow`, which the operator can insert after
|
||||
`device-input`, in other words, if enabled, the plugin will get a copy of all packets that are read
|
||||
from an input provider, such as `dpdk-input` or `rdma-input`. The plugin's job is to process the
|
||||
packet, and if it's not selected for sampling, just move it onwards to the next node, typically
|
||||
`ethernet-input`.
|
||||
|
||||
The kicker is, that one in N packets will be selected to sample, after which:
|
||||
1. the ethernet header (`*en`) is extracted from the packet
|
||||
1. the input interface (`hw_if_index`) is extracted from the VPP buffer
|
||||
1. if there are too many samples from this worker thread being worked on, it is discarded and an
|
||||
error counter is incremented. This protects the main thread from being slammed with samples if
|
||||
there are simply too many being fished out of the dataplane.
|
||||
1. Otherwise:
|
||||
* a new `sflow_sample_t` is created, with all the sampling process metadata filled in
|
||||
* the first 128 bytes of the packet are copied into the sample
|
||||
* an RPC is dispatched to the main thread, which will send the sample to the PSAMPLE channel
|
||||
|
||||
Both a debug CLI command and API call are added:
|
||||
|
||||
```
|
||||
sflow enable-disable <interface-name> [<sampling_N>]|[disable]
|
||||
```
|
||||
|
||||
Some observations:
|
||||
|
||||
First off, the sampling_N in Neil's demo is a global rather than per-interface setting. It would
|
||||
make sense to make this be per-interface, as routers typically have a mixture of 1G/10G and faster
|
||||
100G network cards available. This is not going to be an issue to retrofit.
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
Secondly, sending the RPC to main uses `vl_api_rpc_call_main_thread()`, which
|
||||
requires a _spinlock_ in `src/vlibmemory/memclnt_api.c:649`. I'm somewhat worried that when many
|
||||
samples are sent from many threads, there will be lock contention and performance will nose dive.
|
||||
|
||||
### sFlow Plugin: Functional
|
||||
|
||||
I boot up the [[IPng Lab]({{< ref 2022-10-14-lab-1 >}})] and install a bunch of sFLow tools on it,
|
||||
make sure the `psample` kernel module is loaded. In this first test I'll take a look at
|
||||
tablestakes. I compile VPP with the sFlow plugin, and enable that plugin in `startup.conf` on each
|
||||
of the four VPP routers. For reference, the Lab looks like this:
|
||||
|
||||
{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
|
||||
|
||||
What I'll do is start an `iperf3` server on `vpp0-3` and then hit it from `vpp0-0`, to generate
|
||||
a few TCP traffic streams back and forth, which will be traversing `vpp0-2` and `vpp0-1`, like so:
|
||||
|
||||
```
|
||||
pim@vpp0-3:~ $ iperf3 -s -D
|
||||
pim@vpp0-0:~ $ iperf3 -c vpp0-3.lab.ipng.ch -t 86400 -P 10 -b 10M
|
||||
```
|
||||
|
||||
### Configuring VPP for sFlow
|
||||
|
||||
While this `iperf3` is running, I'll log on to `vpp0-2` to take a closer look. The first thing I do,
|
||||
is turn on packet sampling on `vpp0-2`'s interface that points at `vpp0-3`, which is `Gi10/0/1`, and
|
||||
the interface that points at `vpp0-0`, which is `Gi10/0/0`. That's easy enough, and I will use a
|
||||
sampling rate of 1:1000 as these interfaces are GigabitEthernet:
|
||||
|
||||
```
|
||||
root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/0 1000
|
||||
root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/1 1000
|
||||
root@vpp0-2:~# vppctl show run | egrep '(Name|sflow)'
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 5656 24168 0 9.01e2 4.27
|
||||
```
|
||||
|
||||
Nice! VPP inserted the `sflow` node between `dpdk-input` and `ethernet-input` where it can do its
|
||||
business. But is it sending data? To answer this question, I can first take a look at the
|
||||
`psampletest` tool:
|
||||
|
||||
```
|
||||
root@vpp0-2:~# psampletest
|
||||
pstest: modprobe psample returned 0
|
||||
pstest: netlink socket number = 1637
|
||||
pstest: getFamily
|
||||
pstest: generic netlink CMD = 1
|
||||
pstest: generic family name: psample
|
||||
pstest: generic family id: 32
|
||||
pstest: psample attr type: 4 (nested=0) len: 8
|
||||
pstest: psample attr type: 5 (nested=0) len: 8
|
||||
pstest: psample attr type: 6 (nested=0) len: 24
|
||||
pstest: psample multicast group id: 9
|
||||
pstest: psample multicast group: config
|
||||
pstest: psample multicast group id: 10
|
||||
pstest: psample multicast group: packets
|
||||
pstest: psample found group packets=10
|
||||
pstest: joinGroup 10
|
||||
pstest: received Netlink ACK
|
||||
pstest: joinGroup 10
|
||||
pstest: set headers...
|
||||
pstest: serialize...
|
||||
pstest: print before sending...
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=7 out=9 n=1000 seq=1 pktlen=1514 hdrlen=31 pkt=0x558c08ba4958 q=3 depth=33333333 delay=123456
|
||||
pstest: send...
|
||||
pstest: send_psample getuid=0 geteuid=0
|
||||
pstest: sendmsg returned 140
|
||||
pstest: free...
|
||||
pstest: start read loop...
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=1 out=0 n=1000 seq=600320 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=1 out=0 n=1000 seq=600321 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=1 out=0 n=1000 seq=600322 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=2 out=0 n=1000 seq=600423 pktlen=66 hdrlen=70 pkt=0x7ffdb0d5a1e8 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=1 out=0 n=1000 seq=600324 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
||||
```
|
||||
|
||||
I am amazed! The `psampletest` output shows a few packets, considering I'm asking `iperf3` to push
|
||||
100Mbit using 9000 byte jumboframes (which would be something like 1400 packets/second), I can
|
||||
expect two or three samples per second. I immediately notice a few things:
|
||||
|
||||
***1. Network Namespae***: The Netlink sampling channel belongs to a network _namespace_. The VPP
|
||||
process is running in the _default_ netns, so its PSAMPLE netlink messages will be in that namespace.
|
||||
Thus, the `psampletest` and other tools must also run in that namespace. I mention this because in
|
||||
Linux CP, often times the controlplane interfaces are created in a dedicated `dataplane` network
|
||||
namespace.
|
||||
|
||||
***2. pktlen and hdrlen***: The pktlen is off, and this is a bug. In VPP, packets are put into buffers
|
||||
of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for the
|
||||
same packet. The packet length here ought to be 9000 in one direction. Looking at the `in=2` packet
|
||||
with length 66, that looks like a legitimate ACK packet on the way back. Byt ehy is the hdrlen set
|
||||
to 70 there? I'm going to want to ask Neil about that.
|
||||
|
||||
***3. ingress and egress***: The `in=1` and one packet with `in=2` represent the input `hw_if_index`
|
||||
which is the ifIndex that VPP assigns to its devices. And looking at `show interfaces`, indeed
|
||||
number 1 corresponds with `GigabitEthernet10/0/0` and 2 is `GigabitEthernet10/0/1`, which checks
|
||||
out:
|
||||
```
|
||||
root@vpp0-2:~# vppctl show int
|
||||
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
|
||||
GigabitEthernet10/0/0 1 up 9000/0/0/0 rx packets 469552764
|
||||
rx bytes 4218754400233
|
||||
tx packets 133717230
|
||||
tx bytes 8887341013
|
||||
drops 6050
|
||||
ip4 469321635
|
||||
ip6 225164
|
||||
GigabitEthernet10/0/1 2 up 9000/0/0/0 rx packets 133527636
|
||||
rx bytes 8816920909
|
||||
tx packets 469353481
|
||||
tx bytes 4218736200819
|
||||
drops 6060
|
||||
ip4 133489925
|
||||
ip6 29139
|
||||
|
||||
```
|
||||
|
||||
***4. ifIndexes are orthogonal***: These `in=1` or `in=2` ifIndex numbers are constructs of the VPP
|
||||
dataplane. Notably, VPP's numbering of interface index is strictly _orthogonal_ to Linux, and it's
|
||||
not even guaranteed that there even _exists_ an interface in Linux for the PHY upon which the
|
||||
sampling is happening. Said differently, `in=1` here is meant to reference VPP's
|
||||
`GigabitEthernet10/0/0` interface, but in Linux, `ifIndex=1` is a completely different interface
|
||||
(`lo`) in the default network namespace. Similarly `in=2` for VPP's `Gi10/0/1` interface corresponds
|
||||
to interface `enp1s0` in Linux:
|
||||
|
||||
```
|
||||
root@vpp0-2:~# ip link
|
||||
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
|
||||
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
||||
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
|
||||
link/ether 52:54:00:f0:01:20 brd ff:ff:ff:ff:ff:ff
|
||||
```
|
||||
|
||||
***5. Counters***: sFlow periodically polls the interface counters for all interfaces. It will
|
||||
normally use `/proc/net/` entries for that, but there are two problems with this:
|
||||
1. There may not exist a Linux representation of the interface, for example if it's only doing L2
|
||||
bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane
|
||||
interface.
|
||||
1. Even if it does exist (and it's the "correct" ifIndex in Linux), the statistics counters in the
|
||||
Linux representation will only count packets and octets of _punted_ packets, that is to say,
|
||||
the stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device.
|
||||
Important to note that east-west traffic that goes _through_ the dataplane, is never punted to
|
||||
Linux, and as such, the counters will be undershooting: only counting traffic _to_ the router,
|
||||
not _through_ the router.
|
||||
|
||||
|
||||
### VPP sFlow: Performance
|
||||
|
||||
Now that I've shown that Neil's proof of concept works, I will take a better look at the performance
|
||||
of the plugin. I've made a mental note that the plugin sends RPCs from worker threads to the main
|
||||
thread to marshall the PSAMPLE messages out. I'd like to see how expensive that is, in general. So,
|
||||
I pull boot two Dell R730 machines in IPng's Lab and put them to work. The first machine will run
|
||||
Cisco's T-Rex loadtester with 8x 10Gbps ports (4x dual Intel 58299), while the second (identical)
|
||||
machine will run VPP also ith 8x 10Gbps ports (2x Intel i710-DA4).
|
||||
|
||||
I will test a bunch of things in parallel. First off, I'll test L2 (xconnect) and L3 (IPv4 routing),
|
||||
and secondly I'll test that with and without sFlow turned on. This gives me 8 ports to configure,
|
||||
and I'll start with the L2 configuration, as follows:
|
||||
|
||||
```
|
||||
vpp# set int state TenGigabitEthernet3/0/2 up
|
||||
vpp# set int state TenGigabitEthernet3/0/3 up
|
||||
vpp# set int state TenGigabitEthernet130/0/2 up
|
||||
vpp# set int state TenGigabitEthernet130/0/3 up
|
||||
vpp# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
|
||||
vpp# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
|
||||
vpp# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
|
||||
vpp# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
|
||||
```
|
||||
|
||||
Then, the L3 configuration looks like this:
|
||||
|
||||
```
|
||||
vpp# lcp create TenGigabitEthernet3/0/0 host-if xe0-0
|
||||
vpp# lcp create TenGigabitEthernet3/0/1 host-if xe0-1
|
||||
vpp# lcp create TenGigabitEthernet130/0/0 host-if xe1-0
|
||||
vpp# lcp create TenGigabitEthernet130/0/1 host-if xe1-1
|
||||
vpp# set int state TenGigabitEthernet3/0/0 up
|
||||
vpp# set int state TenGigabitEthernet3/0/1 up
|
||||
vpp# set int state TenGigabitEthernet130/0/0 up
|
||||
vpp# set int state TenGigabitEthernet130/0/1 up
|
||||
vpp# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
|
||||
vpp# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
|
||||
vpp# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
|
||||
vpp# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
|
||||
vpp# ip route add 16.0.0.0/24 via 100.64.0.0
|
||||
vpp# ip route add 48.0.0.0/24 via 100.64.1.0
|
||||
vpp# ip route add 16.0.2.0/24 via 100.64.4.0
|
||||
vpp# ip route add 48.0.2.0/24 via 100.64.5.0
|
||||
vpp# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
|
||||
vpp# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
|
||||
vpp# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
|
||||
vpp# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
|
||||
```
|
||||
|
||||
And finally, the Cisco T-Rex configuration looks like this:
|
||||
|
||||
```
|
||||
- version: 2
|
||||
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
|
||||
port_info:
|
||||
- src_mac: 00:1b:21:06:00:00
|
||||
dest_mac: 9c:69:b4:61:a1:dc
|
||||
- src_mac: 00:1b:21:06:00:01
|
||||
dest_mac: 9c:69:b4:61:a1:dd
|
||||
|
||||
- src_mac: 00:1b:21:83:00:00
|
||||
dest_mac: 00:1b:21:83:00:01
|
||||
- src_mac: 00:1b:21:83:00:01
|
||||
dest_mac: 00:1b:21:83:00:00
|
||||
|
||||
- src_mac: 00:1b:21:87:00:00
|
||||
dest_mac: 9c:69:b4:61:75:d0
|
||||
- src_mac: 00:1b:21:87:00:01
|
||||
dest_mac: 9c:69:b4:61:75:d1
|
||||
|
||||
- src_mac: 9c:69:b4:85:00:00
|
||||
dest_mac: 9c:69:b4:85:00:01
|
||||
- src_mac: 9c:69:b4:85:00:01
|
||||
dest_mac: 9c:69:b4:85:00:00
|
||||
```
|
||||
|
||||
A little note on the use of `ip neighbor` in VPP and specific `dest_mac` in T-Rex. In L2 mode,
|
||||
because the VPP interfaces will be in promiscuous mode and simply pass through any ethernet frame
|
||||
received on interface `Te3/0/2` and copy it out on `Te3/0/3` and vice-versa, there is no need to
|
||||
tinker with MAC addresses. But in L3 mode, the NIC will only accept ethernet frames addressed to its
|
||||
MAC address, so you can see that for the first port in T-Rex, I am setting `dest_mac:
|
||||
9c:69:b4:61:a1:dc` which is the MAC address of `Te3/0/0` on VPP. And then on the way out, if VPP
|
||||
wants to send traffic back to T-Rex, I'll give it a static ARP entry with `ip neighbor .. static`.
|
||||
|
||||
With that said, I can start a baseline loadtest like so:
|
||||
{{< image width="100%" src="/assets/sflow/trex-baseline.png" alt="Cisco T-Rex: baseline" >}}
|
||||
|
||||
T-Rex is sending 10Gbps out on all eight interfaces (four of which are L3 routing, and four of which
|
||||
are L2 xconnecting), using packet size of 1514 bytes. This amounts of roughlu 813Kpps per port, or a
|
||||
cool 6.51Mpps in total. And I can see, in this base line configuration, the VPP router is happy to
|
||||
do the work.
|
||||
|
||||
I then enable sFlow on the second set of four ports, using a 1:1000 sampling rate:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000
|
||||
```
|
||||
|
||||
This should yield about 3'250 or so samples per second, and things look pretty great:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show err
|
||||
Count Node Reason Severity
|
||||
5034508 sflow sflow packets processed error
|
||||
4908 sflow sflow packets sampled error
|
||||
5034508 sflow sflow packets processed error
|
||||
5111 sflow sflow packets sampled error
|
||||
5034516 l2-output L2 output packets error
|
||||
5034516 l2-input L2 input packets error
|
||||
5034404 sflow sflow packets processed error
|
||||
4948 sflow sflow packets sampled error
|
||||
5034404 l2-output L2 output packets error
|
||||
5034404 l2-input L2 input packets error
|
||||
5034404 sflow sflow packets processed error
|
||||
4928 sflow sflow packets sampled error
|
||||
5034404 l2-output L2 output packets error
|
||||
5034404 l2-input L2 input packets error
|
||||
5034516 l2-output L2 output packets error
|
||||
5034516 l2-input L2 input packets error
|
||||
```
|
||||
|
||||
I can see that the `sflow packets sampled` is roughly 0.1% of the `sflow packets processed` which
|
||||
checks out. I can also see in `psampletest` a flurry of activity, so I'm happy:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ sudo psampletest
|
||||
...
|
||||
pstest: grp=1 in=9 out=0 n=1000 seq=63388 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=8 out=0 n=1000 seq=63389 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=11 out=0 n=1000 seq=63390 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=10 out=0 n=1000 seq=63391 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=11 out=0 n=1000 seq=63392 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
```
|
||||
|
||||
and yes, all four `in` interfaced (8, 9, 10 and 11) are sending samples, and those indexes correctly
|
||||
correspond to the VPP dataplane's `sw_if_index` for `TenGig130/0/0 - 3`. Sweet! On this machine,
|
||||
each TenGig network interface has its own dedicated VPP worker thread. Considering I turned on
|
||||
sFlow sampling on four interfaces, I should see the cost I'm paying for the feature:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 3908218 14350684 0 9.05e1 3.67
|
||||
sflow active 3913266 14350680 0 1.11e2 3.67
|
||||
sflow active 3910828 14350687 0 1.08e2 3.67
|
||||
sflow active 3909274 14350692 0 5.66e1 3.67
|
||||
```
|
||||
|
||||
Alright, so for the 999 packets that went through and the one packet that got sampled, on average
|
||||
VPP is spending between 90 and 111 CPU cycles per packet, and the loadtest looks squeaky clean on
|
||||
T-Rex.
|
||||
|
||||
### VPP sFlow: Cost of passthru
|
||||
|
||||
I decide to take a look at two edge cases. What if there are no samples being taken at all, and the
|
||||
`sflow` node is merely passing through all packets to `ethernet-input`? To simulate this, I will set
|
||||
up a hizarrely high sampling rate, say one in ten million. I'll also make the T-Rex loadtester use
|
||||
only four ports, in other words, a unidirectional loadtest, and I'll make it go much faster by
|
||||
sending smaller packets, say 128 bytes:
|
||||
|
||||
```
|
||||
tui>start -f stl/ipng.py -p 0 2 4 6 -m 99% -t size=128
|
||||
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000 disable
|
||||
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10000000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10000000
|
||||
```
|
||||
|
||||
The laodtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the
|
||||
`sFlow` plugin is not sampling many packets:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show err
|
||||
Count Node Reason Severity
|
||||
59777084 sflow sflow packets processed error
|
||||
6 sflow sflow packets sampled error
|
||||
59777152 l2-output L2 output packets error
|
||||
59777152 l2-input L2 input packets error
|
||||
59777104 sflow sflow packets processed error
|
||||
6 sflow sflow packets sampled error
|
||||
59777104 l2-output L2 output packets error
|
||||
59777104 l2-input L2 input packets error
|
||||
|
||||
pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 8186642 369674664 0 1.35e1 45.16
|
||||
sflow active 25173660 369674696 0 1.97e1 14.68
|
||||
```
|
||||
Two observations:
|
||||
|
||||
1. One of these is busier than the other. Without looking further, I can already predict that the
|
||||
top one (doing 45.16 vectors/call) is the L3 thread. Reasoning: the L3 code path through the
|
||||
dataplane is a lot more expensive than 'merely' L2 XConnect. As such, the packets will spend more
|
||||
time, and therefore the iterations of the `dpdk-input` loop will be further apart in time. And
|
||||
because of that, it'll end up consuming more packets on each subsequent call, in order to catch up.
|
||||
The L2 path on the other hand, is quicker and therefore will have less packets waiting on subsequent
|
||||
calls to `dpdk-input`.
|
||||
2. The `sfloww` plugin spends between 13.5 and 19.7 CPU cycles shoveling the packets into
|
||||
`ethernet-input` without doing anything to them. That's pretty low! And the L3 path is a little bit
|
||||
more efficient per packet, which is very likely because it gets to amort its L1/L2 CPU instruction
|
||||
cache over 45 packets each time it runs, while the L2 path can only amort its instruction cache over
|
||||
15 or so packets each time it runs.
|
||||
|
||||
I let the loadtest run overnight,and the proof is in the pudding: sFlow enabled but not sampling
|
||||
works just fine:
|
||||
|
||||
{{< image width="100%" src="/assets/sflow/trex-passthru.png" alt="Cisco T-Rex: passthru" >}}
|
||||
|
||||
### VPP sFlow: Cost of sampling
|
||||
|
||||
The other interesting case is to figure out how much CPU it takes to execute the code path
|
||||
with the actual sampling. This one turns out a bit trickier to measure. While leaving the previous
|
||||
loadtest running at 33.5Mpps, I disable sFlow and then re-enable it at an abnormally _high_ ratio of
|
||||
1:10 packets:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10
|
||||
```
|
||||
|
||||
The T-Rex view immediately reveals that VPP is not doing very well, as the throughput went from
|
||||
33.5Mpps all the way down to 7.5Mpps. Ouch! Looking at the dataplane:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show err | grep sflow
|
||||
340502528 sflow sflow packets processed error
|
||||
12254462 sflow sflow packets dropped error
|
||||
22611461 sflow sflow packets sampled error
|
||||
422527140 sflow sflow packets processed error
|
||||
8533855 sflow sflow packets dropped error
|
||||
34235952 sflow sflow packets sampled error
|
||||
```
|
||||
|
||||
Ha, this new safeguard popped up: remember all the way at the beginning, I explained how there's a
|
||||
safety net in the `sflow` plugin that will pre-emptively drop the sample if the RPC channel towards
|
||||
the main thread is seeing too many outstanding RPCs? That's happening right now, under the moniker
|
||||
`sflow packets dropped`, and it's roughly *half* of the samples.
|
||||
|
||||
My first attempt is to back off the loadtester to roughly 1.5Mpps per port (so 6Mpps in total, under the
|
||||
current limit of 7.5Mpps), but I'm disappointed: the VPP instance is now returning 665Kpps per port
|
||||
only, which is horrible, and it's still dropping samples.
|
||||
|
||||
My second attempt is to turn off all ports but last pair (the L2XC port), which returns 930Kpps from
|
||||
the offered 1.5Mpps. VPP is clearly not having a good time here.
|
||||
|
||||
Finally, as a validation, I turn off all ports but the first pair (the L3 port, without sFlow), and
|
||||
ramp up the traffic to 8Mpps. Success (unsurprising to me). I also ramp up the second pair (the L2XC
|
||||
port, without sFlow), VPP forwards all 16Mpps and is happy again.
|
||||
|
||||
Once I turn on the third pair (the L3 port, _with_ sFlow), even at 1Mpps, the whole situation
|
||||
regresses again: First two ports go down from 8Mpps to 5.2Mpps each; the third (offending) port
|
||||
delivers 740Kpps out of 1Mpps. Clearly, there's some work to do under high load situations!
|
||||
|
||||
#### Reasoning about the bottle neck
|
||||
|
||||
But how expensive is sending samples, really? To try to get at least some pseudo-scientific answer I
|
||||
turn off all ports again, and ramp up the one port pair with (L3 + sFlow at 1:10 ratio) to full line
|
||||
rate: that is 64 byte packets at 14.88Mpps:
|
||||
|
||||
```
|
||||
tui>stop
|
||||
tui>start -f stl/ipng.py -m 100% -p 4 -t size=64
|
||||
```
|
||||
|
||||
VPP is now on the struggle bus and is returning 3.17Mpps or 21% of that. But, I think it'll give me
|
||||
some reasonable data to try to feel out where the bottleneck is.
|
||||
|
||||
```
|
||||
Thread 2 vpp_wk_1 (lcore 3)
|
||||
Time 6.3, 10 sec internal node vector rate 256.00 loops/sec 27310.73
|
||||
vector rates in 3.1607e6, out 3.1607e6, drop 0.0000e0, punt 0.0000e0
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
TenGigabitEthernet130/0/1-outp active 77906 19943936 0 5.79e0 256.00
|
||||
TenGigabitEthernet130/0/1-tx active 77906 19943936 0 6.88e1 256.00
|
||||
dpdk-input polling 77906 19943936 0 4.41e1 256.00
|
||||
ethernet-input active 77906 19943936 0 2.21e1 256.00
|
||||
ip4-input active 77906 19943936 0 2.05e1 256.00
|
||||
ip4-load-balance active 77906 19943936 0 1.07e1 256.00
|
||||
ip4-lookup active 77906 19943936 0 1.98e1 256.00
|
||||
ip4-rewrite active 77906 19943936 0 1.97e1 256.00
|
||||
sflow active 77906 19943936 0 6.14e1 256.00
|
||||
|
||||
pim@hvn6-lab:pim# vppctl show err | grep sflow
|
||||
551357440 sflow sflow packets processed error
|
||||
19829380 sflow sflow packets dropped error
|
||||
36613544 sflow sflow packets sampled error
|
||||
```
|
||||
|
||||
OK, the `sflow` plugin saw 551M packets, selected 36.6M of them for sampling, but ultimately only
|
||||
sent RPCs to the main thread for 19.8M of them. There are three code paths:
|
||||
|
||||
1. Super cheap: pass through. I already learned that it takes about X=13.5 CPU cycles to pass
|
||||
through a packet.
|
||||
1. Very cheap: select sample and construct the RPC, but toss it, costing Y CPU cycles.
|
||||
1. Expensive: select sample, and send the RPC. Z CPU cycles in worker, and another amount in main.
|
||||
|
||||
Now I don't know what Y is, but seeing as the selection only copies some data from the VPP buffer
|
||||
into a new `sflow_sample_t`, and it uses `clip_memcpy_fast()` for the sample header, I'm going to
|
||||
assume it's not _drastically_ more expensive than the super cheap case, so for simplicity I'll
|
||||
guesstimate that it takes Y=20 CPU cyces.
|
||||
|
||||
With that guess out of the way, I can see what the `sflow` plugin is consuming for the third case:
|
||||
|
||||
```
|
||||
AvgClocks = (Total * X + Sampled * Y + RPCSent * Z) / Total
|
||||
|
||||
61.4 = ( 551357440 * 13.5 + 36613544 * 20 + (36613544-19829380) * Z ) / 551357440
|
||||
61.4 = ( 7443325440 + 732270880 + 16784164 * Z ) / 551357440
|
||||
33853346816 = 7443325440 + 732270880 + 16784164 * Z
|
||||
25677750496 = 16784164 * Z
|
||||
Z = 1529
|
||||
```
|
||||
|
||||
Good to know! I find spending 1500 cycles to send the sample pretty reasonable, but for a dataplane
|
||||
that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220 CPU
|
||||
cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets
|
||||
feels dangerous to me.
|
||||
|
||||
Here's where I start my conjecture. If I count the CPU cycles spent in the table above, I will see
|
||||
273 CPU cycles spent on average per packet. The CPU in the VPP router is an `E5-2696 v4 @ 2.20GHz`,
|
||||
which means it should be able to do `2.2e10/273 = 8.06Mpps` per thread, more than double that what I
|
||||
observe (3.16Mpps)! But, for all the `vector rates in` (3.1605e6), it also managed to emit the
|
||||
packets back out (same number: 3.1607e6).
|
||||
|
||||
So why isn't VPP getting more packets from DPDK? I poke around a bit and find an important clue:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed; \
|
||||
sleep 10; vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed
|
||||
rx missed 4065539464
|
||||
rx missed 4182788310
|
||||
```
|
||||
|
||||
In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. But I already know that it
|
||||
performed 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted
|
||||
for! It's just, DPDK never managed to read them from the hardware.
|
||||
|
||||
|
||||
As a validation, I turned off sFlow while keeping that one port at 14.88Mpps. Now, 10.8Mpps were
|
||||
delivered:
|
||||
|
||||
```
|
||||
Thread 2 vpp_wk_1 (lcore 3)
|
||||
Time 14.7, 10 sec internal node vector rate 256.00 loops/sec 40622.64
|
||||
vector rates in 1.0794e7, out 1.0794e7, drop 0.0000e0, punt 0.0000e0
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
TenGigabitEthernet130/0/1-outp active 620012 158723072 0 5.66e0 256.00
|
||||
TenGigabitEthernet130/0/1-tx active 620012 158723072 0 7.01e1 256.00
|
||||
dpdk-input polling 620012 158723072 0 4.39e1 256.00
|
||||
ethernet-input active 620012 158723072 0 1.56e1 256.00
|
||||
ip4-input-no-checksum active 620012 158723072 0 1.43e1 256.00
|
||||
ip4-load-balance active 620012 158723072 0 1.11e1 256.00
|
||||
ip4-lookup active 620012 158723072 0 2.00e1 256.00
|
||||
ip4-rewrite active 620012 158723072 0 2.02e1 256.00
|
||||
```
|
||||
|
||||
Total Clocks: 201 per packet; 2.2GHz/201 = 10.9Mpps, and I am observing 10.8Mpps. As [[North of the
|
||||
Border](https://www.youtube.com/c/NorthoftheBorder)] would say: "That's not just good, it's good
|
||||
_enough_!"
|
||||
|
||||
For completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps), and saw that
|
||||
about 29Mpps of that made it through. Interestingly, what was 3.16Mpps in the single-port line rate
|
||||
loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow threads
|
||||
are also impacted. I spent some time thinking about this and poking around, but I did not find a
|
||||
good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted. Here's a
|
||||
screenshot of VPP on the struggle bus:
|
||||
|
||||
{{< image width="100%" src="/assets/sflow/trex-overload.png" alt="Cisco T-Rex: overload at line rate" >}}
|
||||
|
||||
**Hypothesis**: Due to the _spinlock_ in `vl_api_rpc_call_main_thread()`, the worker CPU is pegged
|
||||
for a longer time, during which the `dpdk-input` PMD can't run, so it misses out on these sweet
|
||||
sweet packets that the network card had dutifully received for it, resulting in the `rx-miss`
|
||||
situation. While VPP's performance measurement shows 273 CPU cycles per packet and 3.16Mpps, this
|
||||
accounts only for 862M cycles, while the thread has 2200M cycles, leaving a whopping 60% of CPU
|
||||
cycles unused in the dataplane. I still don't understand why _other_ worker threads are impacted,
|
||||
though.
|
||||
|
||||
## What's Next
|
||||
|
||||
I'll continue to work with the folks in the sFlow and VPP communities and iterate on the plugin and
|
||||
other **sFlow Agent** machinery. In an upcoming article, I hope to share more details on how to tie
|
||||
the VPP plugin in to the `hsflowd` host sflow daemon in a way that the interface indexes, counters
|
||||
and pcaket lengths are all correct
|
||||
|
||||
I should do a few more tests with a debug binary and profiling turned on. I quickly ran a `perf`
|
||||
over the VPP (release / optimized) binary running on the bench, but it merely said 80% of time was
|
||||
spent in `libvlib` rather than `libvnet` in the baseline (sFlow turned off).
|
||||
|
||||
```
|
||||
root@hvn6-lab:/home/pim# perf record -p 1752441 sleep 10
|
||||
root@hvn6-lab:/home/pim# perf report --stdio --sort=dso
|
||||
# Overhead Shared Object (sFlow) Overhead Shared Object (baseline)
|
||||
# ........ ...................... ........ ........................
|
||||
#
|
||||
79.02% libvlib.so.24.10 54.27% libvlib.so.24.10
|
||||
12.82% libvnet.so.24.10 33.91% libvnet.so.24.10
|
||||
3.77% dpdk_plugin.so 10.87% dpdk_plugin.so
|
||||
3.21% [kernel.kallsyms] 0.81% [kernel.kallsyms]
|
||||
0.29% sflow_plugin.so 0.09% ld-linux-x86-64.so.2
|
||||
0.28% libvppinfra.so.24.10 0.03% libc.so.6
|
||||
0.21% libc.so.6 0.01% libvppinfra.so.24.10
|
||||
0.17% libvlibapi.so.24.10 0.00% libvlibmemory.so.24.10
|
||||
0.15% libvlibmemory.so.24.10
|
||||
0.07% ld-linux-x86-64.so.2
|
||||
0.00% vpp
|
||||
0.00% [vdso]
|
||||
0.00% libsvm.so.24.10
|
||||
```
|
||||
|
||||
Unfortunately, I'm not much of a profiler expert, being merely a network engineer :) so I may have
|
||||
to ask for help. Of course, if you're reading this, you may also _offer_ help! There's lots of
|
||||
interesting work to do on this `sflow` plugin, with matching ifIndex for consumers like `hsflowd`,
|
||||
reading interface counters from the dataplane (or from the Prometheus Exporter), and most
|
||||
importantly, ensuring it works well, or fails gracefully, under stringent load.
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
The plugin I am testing here is a prototype written by Neil McKee of inMon. I also wanted to say
|
||||
thanks to Pavel Odintsov of FastNetMon and Ciprian Balaceanu for showing an interest in this plugin,
|
||||
and Peter Phaal for facilitating a get-together last year.
|
||||
|
||||
Who's up for making this thing a reality?!
|
BIN
static/assets/sflow/sflow.gif
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/sflow.gif
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/sflow/trex-baseline.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/trex-baseline.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/sflow/trex-overload.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/trex-overload.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/sflow/trex-passthru.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/trex-passthru.png
(Stored with Git LFS)
Normal file
Binary file not shown.
Reference in New Issue
Block a user