A few typo fixes and clarifications
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -21,7 +21,7 @@ one in a thousand or ten thousand packet headers with flow _sampling_ not be jus
|
||||
In the months that followed, I discussed the feature with the incredible folks at
|
||||
[[inMon](https://inmon.com/)], the original designers and maintainers of the sFlow protocol and
|
||||
toolkit. Neil from inMon wrote a prototype and put it on [[GitHub](https://githubn.com/sflow/vpp)]
|
||||
but for lack of time I didn't manage to get it to work.
|
||||
but for lack of time I didn't manage to get it to work, which was largely my fault by the way.
|
||||
|
||||
However, I have a bit of time on my hands in September and October, and just a few weeks ago,
|
||||
my buddy Pavel from [[FastNetMon](https://fastnetmon.com/)] pinged that very dormant thread about
|
||||
@ -37,16 +37,17 @@ N datagrams (typically 1:1000 or 1:10000) going through the physical network int
|
||||
On the device, an **sFlow Agent** does the sampling. For each sample the Agent takes, the first M
|
||||
bytes (typically 128) are copied into an sFlow Datagram. Sampling metadata is added, such as
|
||||
the ingress (or egress) interface and sampling process parameters. The Agent can then optionally add
|
||||
forwarding information (such as router source- and destination prefix, and what-not, MPLS LSP
|
||||
information, BGP communties, and what-not). Finally the Agent will will periodically read the octet
|
||||
and packet counters of the physical network interface. Ultimately, the Agent will send the samples
|
||||
and additional information to an **sFlow Collector** for further processing.
|
||||
forwarding information (such as router source- and destination prefix, MPLS LSP information, BGP
|
||||
communties, and what-not). Finally the Agent will periodically read the octet and packet counters of
|
||||
physical network interface(s). Ultimately, the Agent will send the samples and additional
|
||||
information over the network as a UDP datagram, to an **sFlow Collector** for further processing.
|
||||
|
||||
sFlow has been specifically designed to take advantages of the statistical properties of packet
|
||||
sampling and can be modeled using statistical sampling theory. This means that the sFlow traffic
|
||||
monitoring system will always produce statistically quantifiable measurements. You can read more
|
||||
about it in Peter Phaal and Sonia Panchen's
|
||||
[[paper](https://sflow.org/packetSamplingBasics/index.htm)], I certainly did :)
|
||||
[[paper](https://sflow.org/packetSamplingBasics/index.htm)], I certainly did and my head spun a
|
||||
little bit at the math :)
|
||||
|
||||
### sFlow: Netlink PSAMPLE
|
||||
|
||||
@ -61,18 +62,20 @@ thing, for me anyway, is that I have a little bit of experience with Netlink due
|
||||
[[Linux Control Plane]({{< ref 2021-08-25-vpp-4 >}})] plugin.
|
||||
|
||||
The idea here is that some **sFlow Agent**, notably a VPP plugin, will be taking periodic samples
|
||||
from the network interfaces, and generating Netlink messages. Then, some other program, notably
|
||||
outside of VPP, can subscribe to these messages and further handle them, creating UDP packets with
|
||||
sFlow samples and counters and other information, and sending them to an **sFlow Collector**
|
||||
from the physical network interfaces, and producing Netlink messages. Then, some other program,
|
||||
notably outside of VPP, can consume these messages and further handle them, creating UDP packets
|
||||
with sFlow samples and counters and other information, and sending them to an **sFlow Collector**
|
||||
somewhere else on the network.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Warning" >}}
|
||||
|
||||
There's a handy utility called [[psampletest](https://github.com/sflow/psampletest)] which can
|
||||
subscribe to these PSAMPLE netlink groups and retrieve the samples. The first time I used all of
|
||||
this stuff, I kept on getting errors. It turns out, there's a kernel module that needs to be loaded:
|
||||
`modprobe psample` and make sure it's in `/etc/modules` before you spend as many hours as I did
|
||||
pulling out hair.
|
||||
this stuff, I wasn't aware of this utility and I kept on getting errors. It turns out, there's a
|
||||
kernel module that needs to be loaded: `modprobe psample` and `psampletest` helpfully does that for
|
||||
you [[ref](https://github.com/sflow/psampletest/blob/main/psampletest.c#L799)], so just make sure
|
||||
the module is loaded and added to `/etc/modules` before you spend as many hours as I did pulling out
|
||||
hair.
|
||||
|
||||
## VPP: sFlow Plugin
|
||||
|
||||
@ -87,11 +90,12 @@ plugin will create a new VPP _graph node_ called `sflow`, which the operator can
|
||||
`device-input`, in other words, if enabled, the plugin will get a copy of all packets that are read
|
||||
from an input provider, such as `dpdk-input` or `rdma-input`. The plugin's job is to process the
|
||||
packet, and if it's not selected for sampling, just move it onwards to the next node, typically
|
||||
`ethernet-input`.
|
||||
`ethernet-input`. Almost all of the interesting action is in `node.c`
|
||||
|
||||
The kicker is, that one in N packets will be selected to sample, after which:
|
||||
1. the ethernet header (`*en`) is extracted from the packet
|
||||
1. the input interface (`hw_if_index`) is extracted from the VPP buffer
|
||||
1. the input interface (`hw_if_index`) is extracted from the VPP buffer. Remember, sFlow works
|
||||
with physical network interfaces!
|
||||
1. if there are too many samples from this worker thread being worked on, it is discarded and an
|
||||
error counter is incremented. This protects the main thread from being slammed with samples if
|
||||
there are simply too many being fished out of the dataplane.
|
||||
@ -110,13 +114,15 @@ Some observations:
|
||||
|
||||
First off, the sampling_N in Neil's demo is a global rather than per-interface setting. It would
|
||||
make sense to make this be per-interface, as routers typically have a mixture of 1G/10G and faster
|
||||
100G network cards available. This is not going to be an issue to retrofit.
|
||||
100G network cards available. It was a surprise when I set one interface to 1:1000 and the other to
|
||||
1:10000 and then saw the first interface change its sampling rate also. It's a small thing, and
|
||||
will not be an issue to change.
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
Secondly, sending the RPC to main uses `vl_api_rpc_call_main_thread()`, which
|
||||
requires a _spinlock_ in `src/vlibmemory/memclnt_api.c:649`. I'm somewhat worried that when many
|
||||
samples are sent from many threads, there will be lock contention and performance will nose dive.
|
||||
samples are sent from many threads, there will be lock contention and performance will suffer.
|
||||
|
||||
### sFlow Plugin: Functional
|
||||
|
||||
@ -205,11 +211,11 @@ Thus, the `psampletest` and other tools must also run in that namespace. I menti
|
||||
Linux CP, often times the controlplane interfaces are created in a dedicated `dataplane` network
|
||||
namespace.
|
||||
|
||||
***2. pktlen and hdrlen***: The pktlen is off, and this is a bug. In VPP, packets are put into buffers
|
||||
of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for the
|
||||
same packet. The packet length here ought to be 9000 in one direction. Looking at the `in=2` packet
|
||||
with length 66, that looks like a legitimate ACK packet on the way back. Byt ehy is the hdrlen set
|
||||
to 70 there? I'm going to want to ask Neil about that.
|
||||
***2. pktlen and hdrlen***: The pktlen is wrong, and this is a bug. In VPP, packets are put into
|
||||
buffers of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for
|
||||
the same packet. The packet length here ought to be 9000 in one direction. Looking at the `in=2`
|
||||
packet with length 66, that looks like a legitimate ACK packet on the way back. But why is the
|
||||
hdrlen set to 70 there? I'm going to want to ask Neil about that.
|
||||
|
||||
***3. ingress and egress***: The `in=1` and one packet with `in=2` represent the input `hw_if_index`
|
||||
which is the ifIndex that VPP assigns to its devices. And looking at `show interfaces`, indeed
|
||||
@ -237,11 +243,11 @@ GigabitEthernet10/0/1 2 up 9000/0/0/0 rx packets
|
||||
|
||||
***4. ifIndexes are orthogonal***: These `in=1` or `in=2` ifIndex numbers are constructs of the VPP
|
||||
dataplane. Notably, VPP's numbering of interface index is strictly _orthogonal_ to Linux, and it's
|
||||
not even guaranteed that there even _exists_ an interface in Linux for the PHY upon which the
|
||||
sampling is happening. Said differently, `in=1` here is meant to reference VPP's
|
||||
`GigabitEthernet10/0/0` interface, but in Linux, `ifIndex=1` is a completely different interface
|
||||
(`lo`) in the default network namespace. Similarly `in=2` for VPP's `Gi10/0/1` interface corresponds
|
||||
to interface `enp1s0` in Linux:
|
||||
not guaranteed that there even _exists_ an interface in Linux for the PHY upon which the sampling is
|
||||
happening. Said differently, `in=1` here is meant to reference VPP's `GigabitEthernet10/0/0`
|
||||
interface, but in Linux, `ifIndex=1` is a completely different interface (`lo`) in the default
|
||||
network namespace. Similarly `in=2` for VPP's `Gi10/0/1` interface corresponds to interface `enp1s0`
|
||||
in Linux:
|
||||
|
||||
```
|
||||
root@vpp0-2:~# ip link
|
||||
@ -253,16 +259,18 @@ root@vpp0-2:~# ip link
|
||||
|
||||
***5. Counters***: sFlow periodically polls the interface counters for all interfaces. It will
|
||||
normally use `/proc/net/` entries for that, but there are two problems with this:
|
||||
1. There may not exist a Linux representation of the interface, for example if it's only doing L2
|
||||
bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane
|
||||
interface.
|
||||
1. Even if it does exist (and it's the "correct" ifIndex in Linux), the statistics counters in the
|
||||
Linux representation will only count packets and octets of _punted_ packets, that is to say,
|
||||
the stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device.
|
||||
Important to note that east-west traffic that goes _through_ the dataplane, is never punted to
|
||||
Linux, and as such, the counters will be undershooting: only counting traffic _to_ the router,
|
||||
not _through_ the router.
|
||||
|
||||
1. There may not exist a Linux representation of the interface, for example if it's only doing L2
|
||||
bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane
|
||||
interface, or `linux-cp` is not used at all.
|
||||
|
||||
1. Even if it does exist and it's the "correct" ifIndex in Linux, for example if the _Linux
|
||||
Interface Pair_'s tuntap `hosf_vif_index` index is used, even then the statistics counters in the
|
||||
Linux representation will only count packets and octets of _punted_ packets, that is to say, the
|
||||
stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device. Important
|
||||
to note that east-west traffic that goes _through_ the dataplane, is never punted to Linux, and as
|
||||
such, the counters will be undershooting: only counting traffic _to_ the router, not _through_ the
|
||||
router.
|
||||
|
||||
### VPP sFlow: Performance
|
||||
|
||||
@ -405,10 +413,10 @@ pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=11 out=0 n=1000 seq=63392 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
```
|
||||
|
||||
and yes, all four `in` interfaced (8, 9, 10 and 11) are sending samples, and those indexes correctly
|
||||
correspond to the VPP dataplane's `sw_if_index` for `TenGig130/0/0 - 3`. Sweet! On this machine,
|
||||
each TenGig network interface has its own dedicated VPP worker thread. Considering I turned on
|
||||
sFlow sampling on four interfaces, I should see the cost I'm paying for the feature:
|
||||
I confirm that all four `in` interfaced (8, 9, 10 and 11) are sending samples, and those indexes
|
||||
correctly correspond to the VPP dataplane's `sw_if_index` for `TenGig130/0/0 - 3`. Sweet! On this
|
||||
machine, each TenGig network interface has its own dedicated VPP worker thread. Considering I
|
||||
turned on sFlow sampling on four interfaces, I should see the cost I'm paying for the feature:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
|
||||
@ -427,7 +435,7 @@ T-Rex.
|
||||
|
||||
I decide to take a look at two edge cases. What if there are no samples being taken at all, and the
|
||||
`sflow` node is merely passing through all packets to `ethernet-input`? To simulate this, I will set
|
||||
up a hizarrely high sampling rate, say one in ten million. I'll also make the T-Rex loadtester use
|
||||
up a bizarrely high sampling rate, say one in ten million. I'll also make the T-Rex loadtester use
|
||||
only four ports, in other words, a unidirectional loadtest, and I'll make it go much faster by
|
||||
sending smaller packets, say 128 bytes:
|
||||
|
||||
@ -443,7 +451,7 @@ pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10000000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10000000
|
||||
```
|
||||
|
||||
The laodtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the
|
||||
The loadtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the
|
||||
`sFlow` plugin is not sampling many packets:
|
||||
|
||||
```
|
||||
@ -469,9 +477,10 @@ Two observations:
|
||||
top one (doing 45.16 vectors/call) is the L3 thread. Reasoning: the L3 code path through the
|
||||
dataplane is a lot more expensive than 'merely' L2 XConnect. As such, the packets will spend more
|
||||
time, and therefore the iterations of the `dpdk-input` loop will be further apart in time. And
|
||||
because of that, it'll end up consuming more packets on each subsequent call, in order to catch up.
|
||||
The L2 path on the other hand, is quicker and therefore will have less packets waiting on subsequent
|
||||
calls to `dpdk-input`.
|
||||
because of that, it'll end up consuming more packets on each subsequent iteration, in order to catch
|
||||
up. The L2 path on the other hand, is quicker and therefore will have less packets waiting on
|
||||
subsequent iterations of `dpdk-input`.
|
||||
|
||||
2. The `sfloww` plugin spends between 13.5 and 19.7 CPU cycles shoveling the packets into
|
||||
`ethernet-input` without doing anything to them. That's pretty low! And the L3 path is a little bit
|
||||
more efficient per packet, which is very likely because it gets to amort its L1/L2 CPU instruction
|
||||
@ -541,7 +550,7 @@ tui>stop
|
||||
tui>start -f stl/ipng.py -m 100% -p 4 -t size=64
|
||||
```
|
||||
|
||||
VPP is now on the struggle bus and is returning 3.17Mpps or 21% of that. But, I think it'll give me
|
||||
VPP is now on the struggle bus and is returning 3.16Mpps or 21% of that. But, I think it'll give me
|
||||
some reasonable data to try to feel out where the bottleneck is.
|
||||
|
||||
```
|
||||
@ -566,7 +575,8 @@ pim@hvn6-lab:pim# vppctl show err | grep sflow
|
||||
```
|
||||
|
||||
OK, the `sflow` plugin saw 551M packets, selected 36.6M of them for sampling, but ultimately only
|
||||
sent RPCs to the main thread for 19.8M of them. There are three code paths:
|
||||
sent RPCs to the main thread for 16.8M samples after having dropped 19.8M of them. There are three
|
||||
code paths, each one extending the other:
|
||||
|
||||
1. Super cheap: pass through. I already learned that it takes about X=13.5 CPU cycles to pass
|
||||
through a packet.
|
||||
@ -590,15 +600,15 @@ AvgClocks = (Total * X + Sampled * Y + RPCSent * Z) / Total
|
||||
Z = 1529
|
||||
```
|
||||
|
||||
Good to know! I find spending 1500 cycles to send the sample pretty reasonable, but for a dataplane
|
||||
that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220 CPU
|
||||
cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets
|
||||
Good to know! I find spending O(1500) cycles to send the sample pretty reasonable. However, for a
|
||||
dataplane that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220
|
||||
CPU cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets
|
||||
feels dangerous to me.
|
||||
|
||||
Here's where I start my conjecture. If I count the CPU cycles spent in the table above, I will see
|
||||
273 CPU cycles spent on average per packet. The CPU in the VPP router is an `E5-2696 v4 @ 2.20GHz`,
|
||||
which means it should be able to do `2.2e10/273 = 8.06Mpps` per thread, more than double that what I
|
||||
observe (3.16Mpps)! But, for all the `vector rates in` (3.1605e6), it also managed to emit the
|
||||
observe (3.16Mpps)! But, for all the `vector rates in` (3.1607e6), it also managed to emit the
|
||||
packets back out (same number: 3.1607e6).
|
||||
|
||||
So why isn't VPP getting more packets from DPDK? I poke around a bit and find an important clue:
|
||||
@ -610,9 +620,9 @@ pim@hvn6-lab:~$ vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed; \
|
||||
rx missed 4182788310
|
||||
```
|
||||
|
||||
In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. But I already know that it
|
||||
performed 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted
|
||||
for! It's just, DPDK never managed to read them from the hardware.
|
||||
In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. I already measured that it
|
||||
forwarded 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted
|
||||
for! It's just, DPDK never managed to read them from the hardware: `sad-trombone.wav`
|
||||
|
||||
|
||||
As a validation, I turned off sFlow while keeping that one port at 14.88Mpps. Now, 10.8Mpps were
|
||||
@ -637,12 +647,12 @@ Total Clocks: 201 per packet; 2.2GHz/201 = 10.9Mpps, and I am observing 10.8Mpps
|
||||
Border](https://www.youtube.com/c/NorthoftheBorder)] would say: "That's not just good, it's good
|
||||
_enough_!"
|
||||
|
||||
For completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps), and saw that
|
||||
For completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps 🥰), and saw that
|
||||
about 29Mpps of that made it through. Interestingly, what was 3.16Mpps in the single-port line rate
|
||||
loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow threads
|
||||
are also impacted. I spent some time thinking about this and poking around, but I did not find a
|
||||
good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted. Here's a
|
||||
screenshot of VPP on the struggle bus:
|
||||
loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow worker
|
||||
threads are also impacted. I spent some time thinking about this and poking around, but I did not
|
||||
find a good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted.
|
||||
Here's a screenshot of VPP on the struggle bus:
|
||||
|
||||
{{< image width="100%" src="/assets/sflow/trex-overload.png" alt="Cisco T-Rex: overload at line rate" >}}
|
||||
|
||||
|
Reference in New Issue
Block a user