A few typo fixes and clarifications
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
2024-09-09 11:00:22 +02:00
parent 08d55e6ac0
commit 1379c77181

View File

@ -21,7 +21,7 @@ one in a thousand or ten thousand packet headers with flow _sampling_ not be jus
In the months that followed, I discussed the feature with the incredible folks at
[[inMon](https://inmon.com/)], the original designers and maintainers of the sFlow protocol and
toolkit. Neil from inMon wrote a prototype and put it on [[GitHub](https://githubn.com/sflow/vpp)]
but for lack of time I didn't manage to get it to work.
but for lack of time I didn't manage to get it to work, which was largely my fault by the way.
However, I have a bit of time on my hands in September and October, and just a few weeks ago,
my buddy Pavel from [[FastNetMon](https://fastnetmon.com/)] pinged that very dormant thread about
@ -37,16 +37,17 @@ N datagrams (typically 1:1000 or 1:10000) going through the physical network int
On the device, an **sFlow Agent** does the sampling. For each sample the Agent takes, the first M
bytes (typically 128) are copied into an sFlow Datagram. Sampling metadata is added, such as
the ingress (or egress) interface and sampling process parameters. The Agent can then optionally add
forwarding information (such as router source- and destination prefix, and what-not, MPLS LSP
information, BGP communties, and what-not). Finally the Agent will will periodically read the octet
and packet counters of the physical network interface. Ultimately, the Agent will send the samples
and additional information to an **sFlow Collector** for further processing.
forwarding information (such as router source- and destination prefix, MPLS LSP information, BGP
communties, and what-not). Finally the Agent will periodically read the octet and packet counters of
physical network interface(s). Ultimately, the Agent will send the samples and additional
information over the network as a UDP datagram, to an **sFlow Collector** for further processing.
sFlow has been specifically designed to take advantages of the statistical properties of packet
sampling and can be modeled using statistical sampling theory. This means that the sFlow traffic
monitoring system will always produce statistically quantifiable measurements. You can read more
about it in Peter Phaal and Sonia Panchen's
[[paper](https://sflow.org/packetSamplingBasics/index.htm)], I certainly did :)
[[paper](https://sflow.org/packetSamplingBasics/index.htm)], I certainly did and my head spun a
little bit at the math :)
### sFlow: Netlink PSAMPLE
@ -61,18 +62,20 @@ thing, for me anyway, is that I have a little bit of experience with Netlink due
[[Linux Control Plane]({{< ref 2021-08-25-vpp-4 >}})] plugin.
The idea here is that some **sFlow Agent**, notably a VPP plugin, will be taking periodic samples
from the network interfaces, and generating Netlink messages. Then, some other program, notably
outside of VPP, can subscribe to these messages and further handle them, creating UDP packets with
sFlow samples and counters and other information, and sending them to an **sFlow Collector**
from the physical network interfaces, and producing Netlink messages. Then, some other program,
notably outside of VPP, can consume these messages and further handle them, creating UDP packets
with sFlow samples and counters and other information, and sending them to an **sFlow Collector**
somewhere else on the network.
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Warning" >}}
There's a handy utility called [[psampletest](https://github.com/sflow/psampletest)] which can
subscribe to these PSAMPLE netlink groups and retrieve the samples. The first time I used all of
this stuff, I kept on getting errors. It turns out, there's a kernel module that needs to be loaded:
`modprobe psample` and make sure it's in `/etc/modules` before you spend as many hours as I did
pulling out hair.
this stuff, I wasn't aware of this utility and I kept on getting errors. It turns out, there's a
kernel module that needs to be loaded: `modprobe psample` and `psampletest` helpfully does that for
you [[ref](https://github.com/sflow/psampletest/blob/main/psampletest.c#L799)], so just make sure
the module is loaded and added to `/etc/modules` before you spend as many hours as I did pulling out
hair.
## VPP: sFlow Plugin
@ -87,11 +90,12 @@ plugin will create a new VPP _graph node_ called `sflow`, which the operator can
`device-input`, in other words, if enabled, the plugin will get a copy of all packets that are read
from an input provider, such as `dpdk-input` or `rdma-input`. The plugin's job is to process the
packet, and if it's not selected for sampling, just move it onwards to the next node, typically
`ethernet-input`.
`ethernet-input`. Almost all of the interesting action is in `node.c`
The kicker is, that one in N packets will be selected to sample, after which:
1. the ethernet header (`*en`) is extracted from the packet
1. the input interface (`hw_if_index`) is extracted from the VPP buffer
1. the input interface (`hw_if_index`) is extracted from the VPP buffer. Remember, sFlow works
with physical network interfaces!
1. if there are too many samples from this worker thread being worked on, it is discarded and an
error counter is incremented. This protects the main thread from being slammed with samples if
there are simply too many being fished out of the dataplane.
@ -110,13 +114,15 @@ Some observations:
First off, the sampling_N in Neil's demo is a global rather than per-interface setting. It would
make sense to make this be per-interface, as routers typically have a mixture of 1G/10G and faster
100G network cards available. This is not going to be an issue to retrofit.
100G network cards available. It was a surprise when I set one interface to 1:1000 and the other to
1:10000 and then saw the first interface change its sampling rate also. It's a small thing, and
will not be an issue to change.
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
Secondly, sending the RPC to main uses `vl_api_rpc_call_main_thread()`, which
requires a _spinlock_ in `src/vlibmemory/memclnt_api.c:649`. I'm somewhat worried that when many
samples are sent from many threads, there will be lock contention and performance will nose dive.
samples are sent from many threads, there will be lock contention and performance will suffer.
### sFlow Plugin: Functional
@ -205,11 +211,11 @@ Thus, the `psampletest` and other tools must also run in that namespace. I menti
Linux CP, often times the controlplane interfaces are created in a dedicated `dataplane` network
namespace.
***2. pktlen and hdrlen***: The pktlen is off, and this is a bug. In VPP, packets are put into buffers
of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for the
same packet. The packet length here ought to be 9000 in one direction. Looking at the `in=2` packet
with length 66, that looks like a legitimate ACK packet on the way back. Byt ehy is the hdrlen set
to 70 there? I'm going to want to ask Neil about that.
***2. pktlen and hdrlen***: The pktlen is wrong, and this is a bug. In VPP, packets are put into
buffers of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for
the same packet. The packet length here ought to be 9000 in one direction. Looking at the `in=2`
packet with length 66, that looks like a legitimate ACK packet on the way back. But why is the
hdrlen set to 70 there? I'm going to want to ask Neil about that.
***3. ingress and egress***: The `in=1` and one packet with `in=2` represent the input `hw_if_index`
which is the ifIndex that VPP assigns to its devices. And looking at `show interfaces`, indeed
@ -237,11 +243,11 @@ GigabitEthernet10/0/1 2 up 9000/0/0/0 rx packets
***4. ifIndexes are orthogonal***: These `in=1` or `in=2` ifIndex numbers are constructs of the VPP
dataplane. Notably, VPP's numbering of interface index is strictly _orthogonal_ to Linux, and it's
not even guaranteed that there even _exists_ an interface in Linux for the PHY upon which the
sampling is happening. Said differently, `in=1` here is meant to reference VPP's
`GigabitEthernet10/0/0` interface, but in Linux, `ifIndex=1` is a completely different interface
(`lo`) in the default network namespace. Similarly `in=2` for VPP's `Gi10/0/1` interface corresponds
to interface `enp1s0` in Linux:
not guaranteed that there even _exists_ an interface in Linux for the PHY upon which the sampling is
happening. Said differently, `in=1` here is meant to reference VPP's `GigabitEthernet10/0/0`
interface, but in Linux, `ifIndex=1` is a completely different interface (`lo`) in the default
network namespace. Similarly `in=2` for VPP's `Gi10/0/1` interface corresponds to interface `enp1s0`
in Linux:
```
root@vpp0-2:~# ip link
@ -253,16 +259,18 @@ root@vpp0-2:~# ip link
***5. Counters***: sFlow periodically polls the interface counters for all interfaces. It will
normally use `/proc/net/` entries for that, but there are two problems with this:
1. There may not exist a Linux representation of the interface, for example if it's only doing L2
bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane
interface.
1. Even if it does exist (and it's the "correct" ifIndex in Linux), the statistics counters in the
Linux representation will only count packets and octets of _punted_ packets, that is to say,
the stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device.
Important to note that east-west traffic that goes _through_ the dataplane, is never punted to
Linux, and as such, the counters will be undershooting: only counting traffic _to_ the router,
not _through_ the router.
1. There may not exist a Linux representation of the interface, for example if it's only doing L2
bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane
interface, or `linux-cp` is not used at all.
1. Even if it does exist and it's the "correct" ifIndex in Linux, for example if the _Linux
Interface Pair_'s tuntap `hosf_vif_index` index is used, even then the statistics counters in the
Linux representation will only count packets and octets of _punted_ packets, that is to say, the
stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device. Important
to note that east-west traffic that goes _through_ the dataplane, is never punted to Linux, and as
such, the counters will be undershooting: only counting traffic _to_ the router, not _through_ the
router.
### VPP sFlow: Performance
@ -405,10 +413,10 @@ pstest: psample netlink (type=32) CMD = 0
pstest: grp=1 in=11 out=0 n=1000 seq=63392 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
```
and yes, all four `in` interfaced (8, 9, 10 and 11) are sending samples, and those indexes correctly
correspond to the VPP dataplane's `sw_if_index` for `TenGig130/0/0 - 3`. Sweet! On this machine,
each TenGig network interface has its own dedicated VPP worker thread. Considering I turned on
sFlow sampling on four interfaces, I should see the cost I'm paying for the feature:
I confirm that all four `in` interfaced (8, 9, 10 and 11) are sending samples, and those indexes
correctly correspond to the VPP dataplane's `sw_if_index` for `TenGig130/0/0 - 3`. Sweet! On this
machine, each TenGig network interface has its own dedicated VPP worker thread. Considering I
turned on sFlow sampling on four interfaces, I should see the cost I'm paying for the feature:
```
pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
@ -427,7 +435,7 @@ T-Rex.
I decide to take a look at two edge cases. What if there are no samples being taken at all, and the
`sflow` node is merely passing through all packets to `ethernet-input`? To simulate this, I will set
up a hizarrely high sampling rate, say one in ten million. I'll also make the T-Rex loadtester use
up a bizarrely high sampling rate, say one in ten million. I'll also make the T-Rex loadtester use
only four ports, in other words, a unidirectional loadtest, and I'll make it go much faster by
sending smaller packets, say 128 bytes:
@ -443,7 +451,7 @@ pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10000000
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10000000
```
The laodtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the
The loadtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the
`sFlow` plugin is not sampling many packets:
```
@ -469,9 +477,10 @@ Two observations:
top one (doing 45.16 vectors/call) is the L3 thread. Reasoning: the L3 code path through the
dataplane is a lot more expensive than 'merely' L2 XConnect. As such, the packets will spend more
time, and therefore the iterations of the `dpdk-input` loop will be further apart in time. And
because of that, it'll end up consuming more packets on each subsequent call, in order to catch up.
The L2 path on the other hand, is quicker and therefore will have less packets waiting on subsequent
calls to `dpdk-input`.
because of that, it'll end up consuming more packets on each subsequent iteration, in order to catch
up. The L2 path on the other hand, is quicker and therefore will have less packets waiting on
subsequent iterations of `dpdk-input`.
2. The `sfloww` plugin spends between 13.5 and 19.7 CPU cycles shoveling the packets into
`ethernet-input` without doing anything to them. That's pretty low! And the L3 path is a little bit
more efficient per packet, which is very likely because it gets to amort its L1/L2 CPU instruction
@ -541,7 +550,7 @@ tui>stop
tui>start -f stl/ipng.py -m 100% -p 4 -t size=64
```
VPP is now on the struggle bus and is returning 3.17Mpps or 21% of that. But, I think it'll give me
VPP is now on the struggle bus and is returning 3.16Mpps or 21% of that. But, I think it'll give me
some reasonable data to try to feel out where the bottleneck is.
```
@ -566,7 +575,8 @@ pim@hvn6-lab:pim# vppctl show err | grep sflow
```
OK, the `sflow` plugin saw 551M packets, selected 36.6M of them for sampling, but ultimately only
sent RPCs to the main thread for 19.8M of them. There are three code paths:
sent RPCs to the main thread for 16.8M samples after having dropped 19.8M of them. There are three
code paths, each one extending the other:
1. Super cheap: pass through. I already learned that it takes about X=13.5 CPU cycles to pass
through a packet.
@ -590,15 +600,15 @@ AvgClocks = (Total * X + Sampled * Y + RPCSent * Z) / Total
Z = 1529
```
Good to know! I find spending 1500 cycles to send the sample pretty reasonable, but for a dataplane
that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220 CPU
cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets
Good to know! I find spending O(1500) cycles to send the sample pretty reasonable. However, for a
dataplane that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220
CPU cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets
feels dangerous to me.
Here's where I start my conjecture. If I count the CPU cycles spent in the table above, I will see
273 CPU cycles spent on average per packet. The CPU in the VPP router is an `E5-2696 v4 @ 2.20GHz`,
which means it should be able to do `2.2e10/273 = 8.06Mpps` per thread, more than double that what I
observe (3.16Mpps)! But, for all the `vector rates in` (3.1605e6), it also managed to emit the
observe (3.16Mpps)! But, for all the `vector rates in` (3.1607e6), it also managed to emit the
packets back out (same number: 3.1607e6).
So why isn't VPP getting more packets from DPDK? I poke around a bit and find an important clue:
@ -610,9 +620,9 @@ pim@hvn6-lab:~$ vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed; \
rx missed 4182788310
```
In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. But I already know that it
performed 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted
for! It's just, DPDK never managed to read them from the hardware.
In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. I already measured that it
forwarded 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted
for! It's just, DPDK never managed to read them from the hardware: `sad-trombone.wav`
As a validation, I turned off sFlow while keeping that one port at 14.88Mpps. Now, 10.8Mpps were
@ -637,12 +647,12 @@ Total Clocks: 201 per packet; 2.2GHz/201 = 10.9Mpps, and I am observing 10.8Mpps
Border](https://www.youtube.com/c/NorthoftheBorder)] would say: "That's not just good, it's good
_enough_!"
For completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps), and saw that
For completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps 🥰), and saw that
about 29Mpps of that made it through. Interestingly, what was 3.16Mpps in the single-port line rate
loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow threads
are also impacted. I spent some time thinking about this and poking around, but I did not find a
good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted. Here's a
screenshot of VPP on the struggle bus:
loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow worker
threads are also impacted. I spent some time thinking about this and poking around, but I did not
find a good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted.
Here's a screenshot of VPP on the struggle bus:
{{< image width="100%" src="/assets/sflow/trex-overload.png" alt="Cisco T-Rex: overload at line rate" >}}