From 08d55e6ac05348dbf5da826c85891c8cedfe9a5f Mon Sep 17 00:00:00 2001 From: Pim van Pelt Date: Mon, 9 Sep 2024 10:26:15 +0200 Subject: [PATCH] sFlow, part 1 --- content/articles/2024-09-08-sflow-1.md | 701 +++++++++++++++++++++++++ static/assets/sflow/sflow.gif | 3 + static/assets/sflow/trex-baseline.png | 3 + static/assets/sflow/trex-overload.png | 3 + static/assets/sflow/trex-passthru.png | 3 + 5 files changed, 713 insertions(+) create mode 100644 content/articles/2024-09-08-sflow-1.md create mode 100644 static/assets/sflow/sflow.gif create mode 100644 static/assets/sflow/trex-baseline.png create mode 100644 static/assets/sflow/trex-overload.png create mode 100644 static/assets/sflow/trex-passthru.png diff --git a/content/articles/2024-09-08-sflow-1.md b/content/articles/2024-09-08-sflow-1.md new file mode 100644 index 0000000..94d8c81 --- /dev/null +++ b/content/articles/2024-09-08-sflow-1.md @@ -0,0 +1,701 @@ +--- +date: "2024-09-08T12:51:23Z" +title: 'VPP with sFlow - Part 1' +--- + +# Introduction + +{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}} + +In January of 2023, an uncomfortably long time ago at this point, an acquaintance of mine called +Ciprian reached out to me after seeing my [[DENOG +#14](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)] presentation. He was interested to learn about +IPFIX and was asking if sFlow would be an option. At the time, there was a plugin in VPP called +[[flowprobe](https://s3-docs.fd.io/vpp/24.10/cli-reference/clis/clicmd_src_plugins_flowprobe.html)] +which is able to emit IPFIX records. Unfotunately I never really got it to work well in my tests, +as either the records were corrupted, sub-interfaces didn't work, or the plugin would just crash the +dataplane entirely. In the meantime, the folks at [[Netgate](https://netgate.com/)] submitted quite +a few fixes to flowprobe, but it remains an expensive operation computationally. Wouldn't copying +one in a thousand or ten thousand packet headers with flow _sampling_ not be just as good? + +In the months that followed, I discussed the feature with the incredible folks at +[[inMon](https://inmon.com/)], the original designers and maintainers of the sFlow protocol and +toolkit. Neil from inMon wrote a prototype and put it on [[GitHub](https://githubn.com/sflow/vpp)] +but for lack of time I didn't manage to get it to work. + +However, I have a bit of time on my hands in September and October, and just a few weeks ago, +my buddy Pavel from [[FastNetMon](https://fastnetmon.com/)] pinged that very dormant thread about +sFlow being a potentially useful tool for anti DDoS protection using VPP. And I very much agree! + +## sFlow: Protocol + +Maintenance of the protocol is performed by the [[sFlow.org](https://sflow.org/)] consortium, the +authoritative source of the sFlow protocol specifications. The current version of sFlow is v5. + +sFlow, short for _sampled Flow_, works at the ethernet layer of the stack, where it inspects one in +N datagrams (typically 1:1000 or 1:10000) going through the physical network interfaces of a device. +On the device, an **sFlow Agent** does the sampling. For each sample the Agent takes, the first M +bytes (typically 128) are copied into an sFlow Datagram. Sampling metadata is added, such as +the ingress (or egress) interface and sampling process parameters. The Agent can then optionally add +forwarding information (such as router source- and destination prefix, and what-not, MPLS LSP +information, BGP communties, and what-not). Finally the Agent will will periodically read the octet +and packet counters of the physical network interface. Ultimately, the Agent will send the samples +and additional information to an **sFlow Collector** for further processing. + +sFlow has been specifically designed to take advantages of the statistical properties of packet +sampling and can be modeled using statistical sampling theory. This means that the sFlow traffic +monitoring system will always produce statistically quantifiable measurements. You can read more +about it in Peter Phaal and Sonia Panchen's +[[paper](https://sflow.org/packetSamplingBasics/index.htm)], I certainly did :) + +### sFlow: Netlink PSAMPLE + +sFlow is meant to be a very _lightweight_ operation for the sampling equipment. It can typically be +done in hardware, but there also exist several software implementations. One very clever thing, I +think, is decoupling the sampler from the rest of the Agent. The Linux kernel has a packet sampling +API called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)], which +allows _producers_ to send samples to a certain _group_, and then allows _consumers_ to subscribe to +samples of a certrain _group_. The PSAMPLE API uses +[[NetLink](https://docs.kernel.org/userspace-api/netlink/intro.html)] under the covers. The cool +thing, for me anyway, is that I have a little bit of experience with Netlink due to my work on VPP's +[[Linux Control Plane]({{< ref 2021-08-25-vpp-4 >}})] plugin. + +The idea here is that some **sFlow Agent**, notably a VPP plugin, will be taking periodic samples +from the network interfaces, and generating Netlink messages. Then, some other program, notably +outside of VPP, can subscribe to these messages and further handle them, creating UDP packets with +sFlow samples and counters and other information, and sending them to an **sFlow Collector** +somewhere else on the network. + +{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Warning" >}} + +There's a handy utility called [[psampletest](https://github.com/sflow/psampletest)] which can +subscribe to these PSAMPLE netlink groups and retrieve the samples. The first time I used all of +this stuff, I kept on getting errors. It turns out, there's a kernel module that needs to be loaded: +`modprobe psample` and make sure it's in `/etc/modules` before you spend as many hours as I did +pulling out hair. + +## VPP: sFlow Plugin + +For the purposes of my initial testing, I'll simply take a look at Neil's prototype on +[[GitHub](https://githubn.com/sflow/vpp)] and see what I learn in terms of functionality and +performance. + +### sFlow Plugin: Anatomy + +The design is purposefully minimal, to do all of the heavy lifting outside of the VPP dataplane. The +plugin will create a new VPP _graph node_ called `sflow`, which the operator can insert after +`device-input`, in other words, if enabled, the plugin will get a copy of all packets that are read +from an input provider, such as `dpdk-input` or `rdma-input`. The plugin's job is to process the +packet, and if it's not selected for sampling, just move it onwards to the next node, typically +`ethernet-input`. + +The kicker is, that one in N packets will be selected to sample, after which: +1. the ethernet header (`*en`) is extracted from the packet +1. the input interface (`hw_if_index`) is extracted from the VPP buffer +1. if there are too many samples from this worker thread being worked on, it is discarded and an + error counter is incremented. This protects the main thread from being slammed with samples if + there are simply too many being fished out of the dataplane. +1. Otherwise: + * a new `sflow_sample_t` is created, with all the sampling process metadata filled in + * the first 128 bytes of the packet are copied into the sample + * an RPC is dispatched to the main thread, which will send the sample to the PSAMPLE channel + +Both a debug CLI command and API call are added: + +``` +sflow enable-disable []|[disable] +``` + +Some observations: + +First off, the sampling_N in Neil's demo is a global rather than per-interface setting. It would +make sense to make this be per-interface, as routers typically have a mixture of 1G/10G and faster +100G network cards available. This is not going to be an issue to retrofit. + +{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}} + +Secondly, sending the RPC to main uses `vl_api_rpc_call_main_thread()`, which +requires a _spinlock_ in `src/vlibmemory/memclnt_api.c:649`. I'm somewhat worried that when many +samples are sent from many threads, there will be lock contention and performance will nose dive. + +### sFlow Plugin: Functional + +I boot up the [[IPng Lab]({{< ref 2022-10-14-lab-1 >}})] and install a bunch of sFLow tools on it, +make sure the `psample` kernel module is loaded. In this first test I'll take a look at +tablestakes. I compile VPP with the sFlow plugin, and enable that plugin in `startup.conf` on each +of the four VPP routers. For reference, the Lab looks like this: + +{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}} + +What I'll do is start an `iperf3` server on `vpp0-3` and then hit it from `vpp0-0`, to generate +a few TCP traffic streams back and forth, which will be traversing `vpp0-2` and `vpp0-1`, like so: + +``` +pim@vpp0-3:~ $ iperf3 -s -D +pim@vpp0-0:~ $ iperf3 -c vpp0-3.lab.ipng.ch -t 86400 -P 10 -b 10M +``` + +### Configuring VPP for sFlow + +While this `iperf3` is running, I'll log on to `vpp0-2` to take a closer look. The first thing I do, +is turn on packet sampling on `vpp0-2`'s interface that points at `vpp0-3`, which is `Gi10/0/1`, and +the interface that points at `vpp0-0`, which is `Gi10/0/0`. That's easy enough, and I will use a +sampling rate of 1:1000 as these interfaces are GigabitEthernet: + +``` +root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/0 1000 +root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/1 1000 +root@vpp0-2:~# vppctl show run | egrep '(Name|sflow)' + Name State Calls Vectors Suspends Clocks Vectors/Call +sflow active 5656 24168 0 9.01e2 4.27 +``` + +Nice! VPP inserted the `sflow` node between `dpdk-input` and `ethernet-input` where it can do its +business. But is it sending data? To answer this question, I can first take a look at the +`psampletest` tool: + +``` +root@vpp0-2:~# psampletest +pstest: modprobe psample returned 0 +pstest: netlink socket number = 1637 +pstest: getFamily +pstest: generic netlink CMD = 1 +pstest: generic family name: psample +pstest: generic family id: 32 +pstest: psample attr type: 4 (nested=0) len: 8 +pstest: psample attr type: 5 (nested=0) len: 8 +pstest: psample attr type: 6 (nested=0) len: 24 +pstest: psample multicast group id: 9 +pstest: psample multicast group: config +pstest: psample multicast group id: 10 +pstest: psample multicast group: packets +pstest: psample found group packets=10 +pstest: joinGroup 10 +pstest: received Netlink ACK +pstest: joinGroup 10 +pstest: set headers... +pstest: serialize... +pstest: print before sending... +pstest: psample netlink (type=32) CMD = 0 +pstest: grp=1 in=7 out=9 n=1000 seq=1 pktlen=1514 hdrlen=31 pkt=0x558c08ba4958 q=3 depth=33333333 delay=123456 +pstest: send... +pstest: send_psample getuid=0 geteuid=0 +pstest: sendmsg returned 140 +pstest: free... +pstest: start read loop... +pstest: psample netlink (type=32) CMD = 0 +pstest: grp=1 in=1 out=0 n=1000 seq=600320 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0 +pstest: psample netlink (type=32) CMD = 0 +pstest: grp=1 in=1 out=0 n=1000 seq=600321 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0 +pstest: psample netlink (type=32) CMD = 0 +pstest: grp=1 in=1 out=0 n=1000 seq=600322 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0 +pstest: psample netlink (type=32) CMD = 0 +pstest: grp=1 in=2 out=0 n=1000 seq=600423 pktlen=66 hdrlen=70 pkt=0x7ffdb0d5a1e8 q=0 depth=0 delay=0 +pstest: psample netlink (type=32) CMD = 0 +pstest: grp=1 in=1 out=0 n=1000 seq=600324 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0 +``` + +I am amazed! The `psampletest` output shows a few packets, considering I'm asking `iperf3` to push +100Mbit using 9000 byte jumboframes (which would be something like 1400 packets/second), I can +expect two or three samples per second. I immediately notice a few things: + +***1. Network Namespae***: The Netlink sampling channel belongs to a network _namespace_. The VPP +process is running in the _default_ netns, so its PSAMPLE netlink messages will be in that namespace. +Thus, the `psampletest` and other tools must also run in that namespace. I mention this because in +Linux CP, often times the controlplane interfaces are created in a dedicated `dataplane` network +namespace. + +***2. pktlen and hdrlen***: The pktlen is off, and this is a bug. In VPP, packets are put into buffers +of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for the +same packet. The packet length here ought to be 9000 in one direction. Looking at the `in=2` packet +with length 66, that looks like a legitimate ACK packet on the way back. Byt ehy is the hdrlen set +to 70 there? I'm going to want to ask Neil about that. + +***3. ingress and egress***: The `in=1` and one packet with `in=2` represent the input `hw_if_index` +which is the ifIndex that VPP assigns to its devices. And looking at `show interfaces`, indeed +number 1 corresponds with `GigabitEthernet10/0/0` and 2 is `GigabitEthernet10/0/1`, which checks +out: +``` +root@vpp0-2:~# vppctl show int + Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count +GigabitEthernet10/0/0 1 up 9000/0/0/0 rx packets 469552764 + rx bytes 4218754400233 + tx packets 133717230 + tx bytes 8887341013 + drops 6050 + ip4 469321635 + ip6 225164 +GigabitEthernet10/0/1 2 up 9000/0/0/0 rx packets 133527636 + rx bytes 8816920909 + tx packets 469353481 + tx bytes 4218736200819 + drops 6060 + ip4 133489925 + ip6 29139 + +``` + +***4. ifIndexes are orthogonal***: These `in=1` or `in=2` ifIndex numbers are constructs of the VPP +dataplane. Notably, VPP's numbering of interface index is strictly _orthogonal_ to Linux, and it's +not even guaranteed that there even _exists_ an interface in Linux for the PHY upon which the +sampling is happening. Said differently, `in=1` here is meant to reference VPP's +`GigabitEthernet10/0/0` interface, but in Linux, `ifIndex=1` is a completely different interface +(`lo`) in the default network namespace. Similarly `in=2` for VPP's `Gi10/0/1` interface corresponds +to interface `enp1s0` in Linux: + +``` +root@vpp0-2:~# ip link +1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 + link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 +2: enp1s0: mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 + link/ether 52:54:00:f0:01:20 brd ff:ff:ff:ff:ff:ff +``` + +***5. Counters***: sFlow periodically polls the interface counters for all interfaces. It will +normally use `/proc/net/` entries for that, but there are two problems with this: +1. There may not exist a Linux representation of the interface, for example if it's only doing L2 + bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane + interface. +1. Even if it does exist (and it's the "correct" ifIndex in Linux), the statistics counters in the + Linux representation will only count packets and octets of _punted_ packets, that is to say, + the stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device. + Important to note that east-west traffic that goes _through_ the dataplane, is never punted to + Linux, and as such, the counters will be undershooting: only counting traffic _to_ the router, + not _through_ the router. + + +### VPP sFlow: Performance + +Now that I've shown that Neil's proof of concept works, I will take a better look at the performance +of the plugin. I've made a mental note that the plugin sends RPCs from worker threads to the main +thread to marshall the PSAMPLE messages out. I'd like to see how expensive that is, in general. So, +I pull boot two Dell R730 machines in IPng's Lab and put them to work. The first machine will run +Cisco's T-Rex loadtester with 8x 10Gbps ports (4x dual Intel 58299), while the second (identical) +machine will run VPP also ith 8x 10Gbps ports (2x Intel i710-DA4). + +I will test a bunch of things in parallel. First off, I'll test L2 (xconnect) and L3 (IPv4 routing), +and secondly I'll test that with and without sFlow turned on. This gives me 8 ports to configure, +and I'll start with the L2 configuration, as follows: + +``` +vpp# set int state TenGigabitEthernet3/0/2 up +vpp# set int state TenGigabitEthernet3/0/3 up +vpp# set int state TenGigabitEthernet130/0/2 up +vpp# set int state TenGigabitEthernet130/0/3 up +vpp# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3 +vpp# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2 +vpp# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3 +vpp# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2 +``` + +Then, the L3 configuration looks like this: + +``` +vpp# lcp create TenGigabitEthernet3/0/0 host-if xe0-0 +vpp# lcp create TenGigabitEthernet3/0/1 host-if xe0-1 +vpp# lcp create TenGigabitEthernet130/0/0 host-if xe1-0 +vpp# lcp create TenGigabitEthernet130/0/1 host-if xe1-1 +vpp# set int state TenGigabitEthernet3/0/0 up +vpp# set int state TenGigabitEthernet3/0/1 up +vpp# set int state TenGigabitEthernet130/0/0 up +vpp# set int state TenGigabitEthernet130/0/1 up +vpp# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31 +vpp# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31 +vpp# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31 +vpp# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31 +vpp# ip route add 16.0.0.0/24 via 100.64.0.0 +vpp# ip route add 48.0.0.0/24 via 100.64.1.0 +vpp# ip route add 16.0.2.0/24 via 100.64.4.0 +vpp# ip route add 48.0.2.0/24 via 100.64.5.0 +vpp# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static +vpp# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static +vpp# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static +vpp# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static +``` + +And finally, the Cisco T-Rex configuration looks like this: + +``` +- version: 2 + interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ] + port_info: + - src_mac: 00:1b:21:06:00:00 + dest_mac: 9c:69:b4:61:a1:dc + - src_mac: 00:1b:21:06:00:01 + dest_mac: 9c:69:b4:61:a1:dd + + - src_mac: 00:1b:21:83:00:00 + dest_mac: 00:1b:21:83:00:01 + - src_mac: 00:1b:21:83:00:01 + dest_mac: 00:1b:21:83:00:00 + + - src_mac: 00:1b:21:87:00:00 + dest_mac: 9c:69:b4:61:75:d0 + - src_mac: 00:1b:21:87:00:01 + dest_mac: 9c:69:b4:61:75:d1 + + - src_mac: 9c:69:b4:85:00:00 + dest_mac: 9c:69:b4:85:00:01 + - src_mac: 9c:69:b4:85:00:01 + dest_mac: 9c:69:b4:85:00:00 +``` + +A little note on the use of `ip neighbor` in VPP and specific `dest_mac` in T-Rex. In L2 mode, +because the VPP interfaces will be in promiscuous mode and simply pass through any ethernet frame +received on interface `Te3/0/2` and copy it out on `Te3/0/3` and vice-versa, there is no need to +tinker with MAC addresses. But in L3 mode, the NIC will only accept ethernet frames addressed to its +MAC address, so you can see that for the first port in T-Rex, I am setting `dest_mac: +9c:69:b4:61:a1:dc` which is the MAC address of `Te3/0/0` on VPP. And then on the way out, if VPP +wants to send traffic back to T-Rex, I'll give it a static ARP entry with `ip neighbor .. static`. + +With that said, I can start a baseline loadtest like so: +{{< image width="100%" src="/assets/sflow/trex-baseline.png" alt="Cisco T-Rex: baseline" >}} + +T-Rex is sending 10Gbps out on all eight interfaces (four of which are L3 routing, and four of which +are L2 xconnecting), using packet size of 1514 bytes. This amounts of roughlu 813Kpps per port, or a +cool 6.51Mpps in total. And I can see, in this base line configuration, the VPP router is happy to +do the work. + +I then enable sFlow on the second set of four ports, using a 1:1000 sampling rate: + +``` +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000 +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000 +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000 +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000 +``` + +This should yield about 3'250 or so samples per second, and things look pretty great: + +``` +pim@hvn6-lab:~$ vppctl show err + Count Node Reason Severity + 5034508 sflow sflow packets processed error + 4908 sflow sflow packets sampled error + 5034508 sflow sflow packets processed error + 5111 sflow sflow packets sampled error + 5034516 l2-output L2 output packets error + 5034516 l2-input L2 input packets error + 5034404 sflow sflow packets processed error + 4948 sflow sflow packets sampled error + 5034404 l2-output L2 output packets error + 5034404 l2-input L2 input packets error + 5034404 sflow sflow packets processed error + 4928 sflow sflow packets sampled error + 5034404 l2-output L2 output packets error + 5034404 l2-input L2 input packets error + 5034516 l2-output L2 output packets error + 5034516 l2-input L2 input packets error +``` + +I can see that the `sflow packets sampled` is roughly 0.1% of the `sflow packets processed` which +checks out. I can also see in `psampletest` a flurry of activity, so I'm happy: + +``` +pim@hvn6-lab:~$ sudo psampletest +... +pstest: grp=1 in=9 out=0 n=1000 seq=63388 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0 +pstest: psample netlink (type=32) CMD = 0 +pstest: grp=1 in=8 out=0 n=1000 seq=63389 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0 +pstest: psample netlink (type=32) CMD = 0 +pstest: grp=1 in=11 out=0 n=1000 seq=63390 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0 +pstest: psample netlink (type=32) CMD = 0 +pstest: grp=1 in=10 out=0 n=1000 seq=63391 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0 +pstest: psample netlink (type=32) CMD = 0 +pstest: grp=1 in=11 out=0 n=1000 seq=63392 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0 +``` + +and yes, all four `in` interfaced (8, 9, 10 and 11) are sending samples, and those indexes correctly +correspond to the VPP dataplane's `sw_if_index` for `TenGig130/0/0 - 3`. Sweet! On this machine, +each TenGig network interface has its own dedicated VPP worker thread. Considering I turned on +sFlow sampling on four interfaces, I should see the cost I'm paying for the feature: + +``` +pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)' +Name State Calls Vectors Suspends Clocks Vectors/Call +sflow active 3908218 14350684 0 9.05e1 3.67 +sflow active 3913266 14350680 0 1.11e2 3.67 +sflow active 3910828 14350687 0 1.08e2 3.67 +sflow active 3909274 14350692 0 5.66e1 3.67 +``` + +Alright, so for the 999 packets that went through and the one packet that got sampled, on average +VPP is spending between 90 and 111 CPU cycles per packet, and the loadtest looks squeaky clean on +T-Rex. + +### VPP sFlow: Cost of passthru + +I decide to take a look at two edge cases. What if there are no samples being taken at all, and the +`sflow` node is merely passing through all packets to `ethernet-input`? To simulate this, I will set +up a hizarrely high sampling rate, say one in ten million. I'll also make the T-Rex loadtester use +only four ports, in other words, a unidirectional loadtest, and I'll make it go much faster by +sending smaller packets, say 128 bytes: + +``` +tui>start -f stl/ipng.py -p 0 2 4 6 -m 99% -t size=128 + +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000 disable +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000 disable +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000 disable +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000 disable + +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10000000 +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10000000 +``` + +The laodtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the +`sFlow` plugin is not sampling many packets: + +``` +pim@hvn6-lab:~$ vppctl show err + Count Node Reason Severity + 59777084 sflow sflow packets processed error + 6 sflow sflow packets sampled error + 59777152 l2-output L2 output packets error + 59777152 l2-input L2 input packets error + 59777104 sflow sflow packets processed error + 6 sflow sflow packets sampled error + 59777104 l2-output L2 output packets error + 59777104 l2-input L2 input packets error + +pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)' +Name State Calls Vectors Suspends Clocks Vectors/Call +sflow active 8186642 369674664 0 1.35e1 45.16 +sflow active 25173660 369674696 0 1.97e1 14.68 +``` +Two observations: + +1. One of these is busier than the other. Without looking further, I can already predict that the +top one (doing 45.16 vectors/call) is the L3 thread. Reasoning: the L3 code path through the +dataplane is a lot more expensive than 'merely' L2 XConnect. As such, the packets will spend more +time, and therefore the iterations of the `dpdk-input` loop will be further apart in time. And +because of that, it'll end up consuming more packets on each subsequent call, in order to catch up. +The L2 path on the other hand, is quicker and therefore will have less packets waiting on subsequent +calls to `dpdk-input`. +2. The `sfloww` plugin spends between 13.5 and 19.7 CPU cycles shoveling the packets into +`ethernet-input` without doing anything to them. That's pretty low! And the L3 path is a little bit +more efficient per packet, which is very likely because it gets to amort its L1/L2 CPU instruction +cache over 45 packets each time it runs, while the L2 path can only amort its instruction cache over +15 or so packets each time it runs. + +I let the loadtest run overnight,and the proof is in the pudding: sFlow enabled but not sampling +works just fine: + +{{< image width="100%" src="/assets/sflow/trex-passthru.png" alt="Cisco T-Rex: passthru" >}} + +### VPP sFlow: Cost of sampling + +The other interesting case is to figure out how much CPU it takes to execute the code path +with the actual sampling. This one turns out a bit trickier to measure. While leaving the previous +loadtest running at 33.5Mpps, I disable sFlow and then re-enable it at an abnormally _high_ ratio of +1:10 packets: + +``` +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 disable +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 disable +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10 +pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10 +``` + +The T-Rex view immediately reveals that VPP is not doing very well, as the throughput went from +33.5Mpps all the way down to 7.5Mpps. Ouch! Looking at the dataplane: + +``` +pim@hvn6-lab:~$ vppctl show err | grep sflow + 340502528 sflow sflow packets processed error + 12254462 sflow sflow packets dropped error + 22611461 sflow sflow packets sampled error + 422527140 sflow sflow packets processed error + 8533855 sflow sflow packets dropped error + 34235952 sflow sflow packets sampled error +``` + +Ha, this new safeguard popped up: remember all the way at the beginning, I explained how there's a +safety net in the `sflow` plugin that will pre-emptively drop the sample if the RPC channel towards +the main thread is seeing too many outstanding RPCs? That's happening right now, under the moniker +`sflow packets dropped`, and it's roughly *half* of the samples. + +My first attempt is to back off the loadtester to roughly 1.5Mpps per port (so 6Mpps in total, under the +current limit of 7.5Mpps), but I'm disappointed: the VPP instance is now returning 665Kpps per port +only, which is horrible, and it's still dropping samples. + +My second attempt is to turn off all ports but last pair (the L2XC port), which returns 930Kpps from +the offered 1.5Mpps. VPP is clearly not having a good time here. + +Finally, as a validation, I turn off all ports but the first pair (the L3 port, without sFlow), and +ramp up the traffic to 8Mpps. Success (unsurprising to me). I also ramp up the second pair (the L2XC +port, without sFlow), VPP forwards all 16Mpps and is happy again. + +Once I turn on the third pair (the L3 port, _with_ sFlow), even at 1Mpps, the whole situation +regresses again: First two ports go down from 8Mpps to 5.2Mpps each; the third (offending) port +delivers 740Kpps out of 1Mpps. Clearly, there's some work to do under high load situations! + +#### Reasoning about the bottle neck + +But how expensive is sending samples, really? To try to get at least some pseudo-scientific answer I +turn off all ports again, and ramp up the one port pair with (L3 + sFlow at 1:10 ratio) to full line +rate: that is 64 byte packets at 14.88Mpps: + +``` +tui>stop +tui>start -f stl/ipng.py -m 100% -p 4 -t size=64 +``` + +VPP is now on the struggle bus and is returning 3.17Mpps or 21% of that. But, I think it'll give me +some reasonable data to try to feel out where the bottleneck is. + +``` +Thread 2 vpp_wk_1 (lcore 3) +Time 6.3, 10 sec internal node vector rate 256.00 loops/sec 27310.73 + vector rates in 3.1607e6, out 3.1607e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TenGigabitEthernet130/0/1-outp active 77906 19943936 0 5.79e0 256.00 +TenGigabitEthernet130/0/1-tx active 77906 19943936 0 6.88e1 256.00 +dpdk-input polling 77906 19943936 0 4.41e1 256.00 +ethernet-input active 77906 19943936 0 2.21e1 256.00 +ip4-input active 77906 19943936 0 2.05e1 256.00 +ip4-load-balance active 77906 19943936 0 1.07e1 256.00 +ip4-lookup active 77906 19943936 0 1.98e1 256.00 +ip4-rewrite active 77906 19943936 0 1.97e1 256.00 +sflow active 77906 19943936 0 6.14e1 256.00 + +pim@hvn6-lab:pim# vppctl show err | grep sflow + 551357440 sflow sflow packets processed error + 19829380 sflow sflow packets dropped error + 36613544 sflow sflow packets sampled error +``` + +OK, the `sflow` plugin saw 551M packets, selected 36.6M of them for sampling, but ultimately only +sent RPCs to the main thread for 19.8M of them. There are three code paths: + +1. Super cheap: pass through. I already learned that it takes about X=13.5 CPU cycles to pass +through a packet. +1. Very cheap: select sample and construct the RPC, but toss it, costing Y CPU cycles. +1. Expensive: select sample, and send the RPC. Z CPU cycles in worker, and another amount in main. + +Now I don't know what Y is, but seeing as the selection only copies some data from the VPP buffer +into a new `sflow_sample_t`, and it uses `clip_memcpy_fast()` for the sample header, I'm going to +assume it's not _drastically_ more expensive than the super cheap case, so for simplicity I'll +guesstimate that it takes Y=20 CPU cyces. + +With that guess out of the way, I can see what the `sflow` plugin is consuming for the third case: + +``` +AvgClocks = (Total * X + Sampled * Y + RPCSent * Z) / Total + +61.4 = ( 551357440 * 13.5 + 36613544 * 20 + (36613544-19829380) * Z ) / 551357440 +61.4 = ( 7443325440 + 732270880 + 16784164 * Z ) / 551357440 +33853346816 = 7443325440 + 732270880 + 16784164 * Z +25677750496 = 16784164 * Z +Z = 1529 +``` + +Good to know! I find spending 1500 cycles to send the sample pretty reasonable, but for a dataplane +that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220 CPU +cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets +feels dangerous to me. + +Here's where I start my conjecture. If I count the CPU cycles spent in the table above, I will see +273 CPU cycles spent on average per packet. The CPU in the VPP router is an `E5-2696 v4 @ 2.20GHz`, +which means it should be able to do `2.2e10/273 = 8.06Mpps` per thread, more than double that what I +observe (3.16Mpps)! But, for all the `vector rates in` (3.1605e6), it also managed to emit the +packets back out (same number: 3.1607e6). + +So why isn't VPP getting more packets from DPDK? I poke around a bit and find an important clue: + +``` +pim@hvn6-lab:~$ vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed; \ + sleep 10; vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed + rx missed 4065539464 + rx missed 4182788310 +``` + +In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. But I already know that it +performed 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted +for! It's just, DPDK never managed to read them from the hardware. + + +As a validation, I turned off sFlow while keeping that one port at 14.88Mpps. Now, 10.8Mpps were +delivered: + +``` +Thread 2 vpp_wk_1 (lcore 3) +Time 14.7, 10 sec internal node vector rate 256.00 loops/sec 40622.64 + vector rates in 1.0794e7, out 1.0794e7, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TenGigabitEthernet130/0/1-outp active 620012 158723072 0 5.66e0 256.00 +TenGigabitEthernet130/0/1-tx active 620012 158723072 0 7.01e1 256.00 +dpdk-input polling 620012 158723072 0 4.39e1 256.00 +ethernet-input active 620012 158723072 0 1.56e1 256.00 +ip4-input-no-checksum active 620012 158723072 0 1.43e1 256.00 +ip4-load-balance active 620012 158723072 0 1.11e1 256.00 +ip4-lookup active 620012 158723072 0 2.00e1 256.00 +ip4-rewrite active 620012 158723072 0 2.02e1 256.00 +``` + +Total Clocks: 201 per packet; 2.2GHz/201 = 10.9Mpps, and I am observing 10.8Mpps. As [[North of the +Border](https://www.youtube.com/c/NorthoftheBorder)] would say: "That's not just good, it's good +_enough_!" + +For completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps), and saw that +about 29Mpps of that made it through. Interestingly, what was 3.16Mpps in the single-port line rate +loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow threads +are also impacted. I spent some time thinking about this and poking around, but I did not find a +good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted. Here's a +screenshot of VPP on the struggle bus: + +{{< image width="100%" src="/assets/sflow/trex-overload.png" alt="Cisco T-Rex: overload at line rate" >}} + +**Hypothesis**: Due to the _spinlock_ in `vl_api_rpc_call_main_thread()`, the worker CPU is pegged +for a longer time, during which the `dpdk-input` PMD can't run, so it misses out on these sweet +sweet packets that the network card had dutifully received for it, resulting in the `rx-miss` +situation. While VPP's performance measurement shows 273 CPU cycles per packet and 3.16Mpps, this +accounts only for 862M cycles, while the thread has 2200M cycles, leaving a whopping 60% of CPU +cycles unused in the dataplane. I still don't understand why _other_ worker threads are impacted, +though. + +## What's Next + +I'll continue to work with the folks in the sFlow and VPP communities and iterate on the plugin and +other **sFlow Agent** machinery. In an upcoming article, I hope to share more details on how to tie +the VPP plugin in to the `hsflowd` host sflow daemon in a way that the interface indexes, counters +and pcaket lengths are all correct + +I should do a few more tests with a debug binary and profiling turned on. I quickly ran a `perf` +over the VPP (release / optimized) binary running on the bench, but it merely said 80% of time was +spent in `libvlib` rather than `libvnet` in the baseline (sFlow turned off). + +``` +root@hvn6-lab:/home/pim# perf record -p 1752441 sleep 10 +root@hvn6-lab:/home/pim# perf report --stdio --sort=dso +# Overhead Shared Object (sFlow) Overhead Shared Object (baseline) +# ........ ...................... ........ ........................ +# + 79.02% libvlib.so.24.10 54.27% libvlib.so.24.10 + 12.82% libvnet.so.24.10 33.91% libvnet.so.24.10 + 3.77% dpdk_plugin.so 10.87% dpdk_plugin.so + 3.21% [kernel.kallsyms] 0.81% [kernel.kallsyms] + 0.29% sflow_plugin.so 0.09% ld-linux-x86-64.so.2 + 0.28% libvppinfra.so.24.10 0.03% libc.so.6 + 0.21% libc.so.6 0.01% libvppinfra.so.24.10 + 0.17% libvlibapi.so.24.10 0.00% libvlibmemory.so.24.10 + 0.15% libvlibmemory.so.24.10 + 0.07% ld-linux-x86-64.so.2 + 0.00% vpp + 0.00% [vdso] + 0.00% libsvm.so.24.10 +``` + +Unfortunately, I'm not much of a profiler expert, being merely a network engineer :) so I may have +to ask for help. Of course, if you're reading this, you may also _offer_ help! There's lots of +interesting work to do on this `sflow` plugin, with matching ifIndex for consumers like `hsflowd`, +reading interface counters from the dataplane (or from the Prometheus Exporter), and most +importantly, ensuring it works well, or fails gracefully, under stringent load. + +## Acknowledgements + +The plugin I am testing here is a prototype written by Neil McKee of inMon. I also wanted to say +thanks to Pavel Odintsov of FastNetMon and Ciprian Balaceanu for showing an interest in this plugin, +and Peter Phaal for facilitating a get-together last year. + +Who's up for making this thing a reality?! diff --git a/static/assets/sflow/sflow.gif b/static/assets/sflow/sflow.gif new file mode 100644 index 0000000..59a0723 --- /dev/null +++ b/static/assets/sflow/sflow.gif @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:1f0c58462a399d3c3a5e79ac70997207e173343f06b6976b093427527f3989ee +size 2128 diff --git a/static/assets/sflow/trex-baseline.png b/static/assets/sflow/trex-baseline.png new file mode 100644 index 0000000..462cb88 --- /dev/null +++ b/static/assets/sflow/trex-baseline.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0c1e0c5576c211e08392a03292cd9f37b5c6759f9a4e3e1e195f3678c8a90265 +size 1113146 diff --git a/static/assets/sflow/trex-overload.png b/static/assets/sflow/trex-overload.png new file mode 100644 index 0000000..48b75ce --- /dev/null +++ b/static/assets/sflow/trex-overload.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:efd01ff996582e01759d7a61fc99169949198d6925992dd5a74d6ca919910432 +size 1124291 diff --git a/static/assets/sflow/trex-passthru.png b/static/assets/sflow/trex-passthru.png new file mode 100644 index 0000000..fb76040 --- /dev/null +++ b/static/assets/sflow/trex-passthru.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8a791772479d07324e17d08b1e96c6f9855045930415219606f87446edb491b8 +size 1094333