@@ -0,0 +1,849 @@
|
|||||||
|
---
|
||||||
|
date: "2025-02-08T07:51:23Z"
|
||||||
|
title: 'VPP with sFlow - Part 3'
|
||||||
|
---
|
||||||
|
|
||||||
|
# Introduction
|
||||||
|
|
||||||
|
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width="12em" >}}
|
||||||
|
|
||||||
|
In the second half of last year, I picked up a project together with Neil McKee of
|
||||||
|
[[inMon](https://inmon.com/)], the care takers of [[sFlow](https://sflow.org)]: an industry standard
|
||||||
|
technology for monitoring high speed networks. `sFlow` gives complete visibility into the
|
||||||
|
use of networks enabling performance optimization, accounting/billing for usage, and defense against
|
||||||
|
security threats.
|
||||||
|
|
||||||
|
The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
|
||||||
|
forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
|
||||||
|
[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the
|
||||||
|
so called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for
|
||||||
|
a small portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but
|
||||||
|
also in the VPP software dataplane. The agent then _transmits_ these samples using a Linux kernel
|
||||||
|
feature called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)].
|
||||||
|
This greatly reduces the complexity of code to be implemented in the forwarding path, while at the
|
||||||
|
same time bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business
|
||||||
|
logic for the more complex state keeping, packet marshalling and transmission from the _Agent_ to a
|
||||||
|
central _Collector_.
|
||||||
|
|
||||||
|
In this third article, I wanted to spend some time discussing how samples make their way out of the
|
||||||
|
VPP dataplane, and into higher level tools.
|
||||||
|
|
||||||
|
## Recap: sFlow
|
||||||
|
|
||||||
|
{{< image float="left" src="/assets/sflow/sflow-overview.png" alt="sFlow Overview" width="14em" >}}
|
||||||
|
|
||||||
|
sFlow describes a method for Monitoring Traffic in Switched/Routed Networks, originally described in
|
||||||
|
[[RFC3176](https://datatracker.ietf.org/doc/html/rfc3176)]. The current specification is version 5
|
||||||
|
and is homed on the sFlow.org website [[ref](https://sflow.org/sflow_version_5.txt)]. Typically, a
|
||||||
|
Switching ASIC in the dataplane (seen at the bottom of the diagram to the left) is asked to copy
|
||||||
|
1-in-N packets to local sFlow Agent.
|
||||||
|
|
||||||
|
**Samples**: The agent will copy the first N bytes (typically 128) of the packet into a sample. As
|
||||||
|
the ASIC knows which interface the packet was received on, the `inIfIndex` will be added. After the
|
||||||
|
egress port(s) are found in an L2 FIB, or a next hop (and port) is found after a routing decision,
|
||||||
|
the ASIC can annotate the sample with this `outIfIndex` and `DstMAC` metadata as well.
|
||||||
|
|
||||||
|
**Drop Monitoring**: There's one rather clever insight that sFlow gives: what if the packet _was
|
||||||
|
not_ routed or switched, but rather discarded? For this, sFlow is able to describe the reason for
|
||||||
|
the drop. For example, the ASIC receive queue could have been overfull, or it did not find a
|
||||||
|
destination to forward the packet to (no FIB entry), perhaps it was instructed by an ACL to drop the
|
||||||
|
packet or maybe even tried to transmit the packet but the physical datalink layer had to abandon the
|
||||||
|
transmission for whatever reason (link down, TX queue full, link saturation, and so on). It's hard
|
||||||
|
to overstate how important it is to have this so-called _drop monitoring_, as operators often spend
|
||||||
|
hours and hours figuring out _why_ packets are lost their network or datacenter switching fabric.
|
||||||
|
|
||||||
|
**Metadata**: The agent may have other metadata as well, such as which prefix was the source and
|
||||||
|
destination of the packet, what additional RIB information do we have (AS path, BGP communities, and
|
||||||
|
so on). This may be added to the sample record as well.
|
||||||
|
|
||||||
|
**Counters**: Since we're doing sampling of 1:N packets, we can estimate total traffic in a
|
||||||
|
reasonably accurate way. Peter and Sonia wrote a succint
|
||||||
|
[[paper](https://sflow.org/packetSamplingBasics/)] about the math, so I won't get into that here.
|
||||||
|
Mostly because I am but a software engineer, not a statistician... :) However, I will say this: if we
|
||||||
|
sample a fraction of the traffic but know how many bytes and packets we saw in total, we can provide
|
||||||
|
an overview with a quantifiable accuracy. This is why the Agent will periodically get the interface
|
||||||
|
counters from the ASIC.
|
||||||
|
|
||||||
|
**Collector**: One or more samples can be concatenated into UDP messages that go from the Agent to a
|
||||||
|
central _sFlow Collector_. The heavy lifting in analysis is done upstream from the switch or router,
|
||||||
|
which is great for performance. Many thousands or even tens of thousands of agents can forward
|
||||||
|
their samples and interface counters to a single central collector, which in turn can be used to
|
||||||
|
draw up a near real time picture of the state of traffic through even the largest of ISP networks or
|
||||||
|
datacenter switch fabrics.
|
||||||
|
|
||||||
|
In sFlow parlance [[VPP](https://fd.io/)] and its companion `hsflowd` is an _Agent_ (it sends the
|
||||||
|
UDP packets over the network), and for example the commandline tool `sflowtool` could be a
|
||||||
|
_Collector_ (it receives the UDP packets).
|
||||||
|
|
||||||
|
## Recap: sFlow in VPP
|
||||||
|
|
||||||
|
First, I have some pretty good news to report - our work on this plugin was
|
||||||
|
[[merged](https://gerrit.fd.io/r/c/vpp/+/41680)] and will be included in the VPP 25.02 release in a
|
||||||
|
few weeks! Last weekend, I gave a lightning talk at
|
||||||
|
[[FOSDEM](https://fosdem.org/2025/schedule/event/fosdem-2025-4196-vpp-monitoring-100gbps-with-sflow/)]
|
||||||
|
and caught up with a lot of community members and network- and software engineers. I had a great
|
||||||
|
time.
|
||||||
|
|
||||||
|
in the dataplane low, we get both high performance, and a smaller probability of bugs causing harm.
|
||||||
|
And I do like simple implementations, as they tend to cause less _SIGSEGVs_. The architecture of the
|
||||||
|
end to end solution consists of three distinct parts, each with their own risk and performance
|
||||||
|
profile:
|
||||||
|
|
||||||
|
{{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}}
|
||||||
|
|
||||||
|
**sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
|
||||||
|
packets from `device-input` to the `ethernet-input` nodes in its forwarding graph, the sFlow plugin
|
||||||
|
will inspect 1-in-N, taking a sample for further processing. Here, we don't try to be clever, simply
|
||||||
|
copy the `inIfIndex` and the first bytes of the ethernet frame, and append them to a
|
||||||
|
[[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))] queue. If too many samples
|
||||||
|
arrive, samples are dropped at the tail, and a counter incremented. This way, I can tell when the
|
||||||
|
dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to
|
||||||
|
get their fair share of samples into the Agent's hands.
|
||||||
|
|
||||||
|
**sFlow main process**: There's a function running on the _main thread_, which shifts further
|
||||||
|
processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it
|
||||||
|
consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones
|
||||||
|
in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is
|
||||||
|
configurable), it'll grab all interface counters from those interfaces for which we have sFlow
|
||||||
|
turned on. VPP produces _Netlink_ messages and sends them to the kernel.
|
||||||
|
|
||||||
|
**host-sflow**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
|
||||||
|
messages. It goes without saying that `hsflowd` is a battle-hardened implementation running on
|
||||||
|
hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy,
|
||||||
|
this module already exists. But Neil implements a _mod_vpp_ which can grab interface names and their
|
||||||
|
`ifIndex`, and counter statistics. VPP emits this data as _Netlink_ `USERSOCK` messages alongside
|
||||||
|
the PSAMPLEs.
|
||||||
|
|
||||||
|
|
||||||
|
By the way, I've written about _Netlink_ before when discussing the [[Linux Control Plane]({{< ref
|
||||||
|
2021-08-25-vpp-4 >}})] plugin. It's a mechanism for programs running in userspace to share
|
||||||
|
information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from
|
||||||
|
kernel to userspace using a _PSAMPLE_ Netlink channel. However, the pattern is that of a message
|
||||||
|
producer/subscriber queue and nothing precludes one userspace process (VPP) to be the producer while
|
||||||
|
another (hsflowd) is the consumer!
|
||||||
|
|
||||||
|
Assuming the sFlow plugin in VPP produces samples and counters properly, `hsflowd` will do the rest,
|
||||||
|
giving correctness and upstream interoperability pretty much for free. That's slick!
|
||||||
|
|
||||||
|
### VPP: sFlow Configuration
|
||||||
|
|
||||||
|
The solution that I offer is based on two moving parts. First, the VPP plugin configuration, which
|
||||||
|
turns on sampling at a given rate on physical devices, also known as _hardware-interfaces_. Second,
|
||||||
|
the open source component [[host-sflow](https://github.com/sflow/host-sflow/releases)] can be
|
||||||
|
configured as of release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
|
||||||
|
|
||||||
|
I can configure VPP in three ways:
|
||||||
|
|
||||||
|
***1. VPP Configuration via CLI***
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@vpp0-0:~$ vppctl
|
||||||
|
vpp0-0# sflow sampling-rate 100
|
||||||
|
vpp0-0# sflow polling-interval 10
|
||||||
|
vpp0-0# sflow header-bytes 128
|
||||||
|
vpp0-0# sflow enable GigabitEthernet10/0/0
|
||||||
|
vpp0-0# sflow enable GigabitEthernet10/0/0 disable
|
||||||
|
vpp0-0# sflow enable GigabitEthernet10/0/2
|
||||||
|
vpp0-0# sflow enable GigabitEthernet10/0/3
|
||||||
|
```
|
||||||
|
|
||||||
|
The first three commands set the global defaults - in my case I'm going to be sampling at 1:100
|
||||||
|
which is unusually high frequency. A production setup may take 1:<linkspeed-in-megabits> so for a
|
||||||
|
1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more
|
||||||
|
appropriate. The second command sets the interface stats polling interval. The default is to gather
|
||||||
|
these statistics every 20 seconds, but I set it to 10s here. Then, I instruct the plugin how many
|
||||||
|
bytes of the sampled ethernet frame should be taken. Common values are 64 and 128. I want enough
|
||||||
|
data to see the headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but
|
||||||
|
the contents of the payload are rarely interesting for statistics purposes. Finally, I can turn on
|
||||||
|
the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP, an idiomatic way to
|
||||||
|
turn on and off things is to have an enabler/disabler. It feels a bit clunky maybe to write `sflow
|
||||||
|
enable $iface disable` but it makes more logical sends if you parse that as "enable-disable" with
|
||||||
|
the default being the "enable" operation, and the alternate being the "disable" operation.
|
||||||
|
|
||||||
|
***2. VPP Configuration via API***
|
||||||
|
|
||||||
|
I wrote a few API calls for the most common operations. Here's a snippet that shows the same calls
|
||||||
|
as from the CLI above, but in Python APIs:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from vpp_papi import VPPApiClient, VPPApiJSONFiles
|
||||||
|
import sys
|
||||||
|
|
||||||
|
vpp_api_dir = VPPApiJSONFiles.find_api_dir([])
|
||||||
|
vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir)
|
||||||
|
vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock")
|
||||||
|
vpp.connect("sflow-api-client")
|
||||||
|
print(vpp.api.show_version().version)
|
||||||
|
# Output: 25.06-rc0~14-g9b1c16039
|
||||||
|
|
||||||
|
vpp.api.sflow_sampling_rate_set(sampling_N=100)
|
||||||
|
print(vpp.api.sflow_sampling_rate_get())
|
||||||
|
# Output: sflow_sampling_rate_get_reply(_0=655, context=3, sampling_N=100)
|
||||||
|
|
||||||
|
vpp.api.sflow_polling_interval_set(polling_S=10)
|
||||||
|
print(vpp.api.sflow_polling_interval_get())
|
||||||
|
# Output: sflow_polling_interval_get_reply(_0=661, context=5, polling_S=10)
|
||||||
|
|
||||||
|
vpp.api.sflow_header_bytes_set(header_B=128)
|
||||||
|
print(vpp.api.sflow_header_bytes_get())
|
||||||
|
# Output: sflow_header_bytes_get_reply(_0=665, context=7, header_B=128)
|
||||||
|
|
||||||
|
vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=True)
|
||||||
|
vpp.api.sflow_enable_disable(hw_if_index=2, enable_disable=True)
|
||||||
|
print(vpp.api.sflow_interface_dump())
|
||||||
|
# Output: [ sflow_interface_details(_0=667, context=8, hw_if_index=1),
|
||||||
|
# sflow_interface_details(_0=667, context=8, hw_if_index=2) ]
|
||||||
|
|
||||||
|
print(vpp.api.sflow_interface_dump(hw_if_index=2))
|
||||||
|
# Output: [ sflow_interface_details(_0=667, context=9, hw_if_index=2) ]
|
||||||
|
|
||||||
|
print(vpp.api.sflow_interface_dump(hw_if_index=1234)) ## Invalid hw_if_index
|
||||||
|
# Output: []
|
||||||
|
|
||||||
|
vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=False)
|
||||||
|
print(vpp.api.sflow_interface_dump())
|
||||||
|
# Output: [ sflow_interface_details(_0=667, context=10, hw_if_index=2) ]
|
||||||
|
```
|
||||||
|
|
||||||
|
This short program toys around a bit with the sFlow API. I first set the sampling to 1:100 and get
|
||||||
|
the current value. Then I set the polling interval to 10s and retrieve the current value again.
|
||||||
|
Finally, I set the header bytes to 128, and retrieve the value again.
|
||||||
|
|
||||||
|
Adding and removing interfaces shows the idiom I mentioned before - the API being an
|
||||||
|
`enable_disable()` call of sorts, and typically taking a flag if the operator wants to enable (the
|
||||||
|
default), or disable sFlow on the interface. Getting the list of enabled interfaces can be done with
|
||||||
|
the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details` messages.
|
||||||
|
|
||||||
|
I wrote an article showing the Python API and how it works in a fair amount of detail in a
|
||||||
|
[[previous article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests you.
|
||||||
|
|
||||||
|
***3. VPPCfg YAML Configuration***
|
||||||
|
|
||||||
|
Writing on the CLI and calling the API is good and all, but many users of VPP have noticed that it
|
||||||
|
does not have any form of configuration persistence and that's on purpose. The project is a
|
||||||
|
programmable dataplane, and explicitly has left the programming and configuration as an exercise for
|
||||||
|
integrators. I have written a small Python program that takes a YAML file as input and uses it to
|
||||||
|
configure (and reconfigure, on the fly) the dataplane automatically. It's called
|
||||||
|
[[VPPcfg](https://github.com/pimvanpelt/vppcfg.git)], and I wrote some implementation thoughts on
|
||||||
|
its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
|
||||||
|
>}})] so I won't repeat that here. Instead, I will just show the configuration:
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@vpp0-0:~$ cat << EOF > vppcfg.yaml
|
||||||
|
interfaces:
|
||||||
|
GigabitEthernet10/0/0:
|
||||||
|
sflow: true
|
||||||
|
GigabitEthernet10/0/1:
|
||||||
|
sflow: true
|
||||||
|
GigabitEthernet10/0/2:
|
||||||
|
sflow: true
|
||||||
|
GigabitEthernet10/0/3:
|
||||||
|
sflow: true
|
||||||
|
|
||||||
|
sflow:
|
||||||
|
sampling-rate: 100
|
||||||
|
polling-interval: 10
|
||||||
|
header-bytes: 128
|
||||||
|
EOF
|
||||||
|
pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp
|
||||||
|
[INFO ] root.main: Loading configfile vppcfg.yaml
|
||||||
|
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
|
||||||
|
[INFO ] root.main: Configuration is valid
|
||||||
|
[INFO ] vppcfg.reconciler.write: Wrote 13 lines to /etc/vpp/config/vppcfg.vpp
|
||||||
|
[INFO ] root.main: Planning succeeded
|
||||||
|
pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp
|
||||||
|
```
|
||||||
|
|
||||||
|
The slick thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
|
||||||
|
1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the `vppcfg plan` and `vppcfg
|
||||||
|
apply` stages and the VPP dataplane will reflect the newly declared configuration.
|
||||||
|
|
||||||
|
### hsflowd: Configuration
|
||||||
|
|
||||||
|
VPP will start to emit _Netlink_ messages, of type PSAMPLE with packet samples and of type USERSOCK
|
||||||
|
with the custom messages about the interface names and counters. These latter custom messages have
|
||||||
|
to be decoded, which is done by the _mod_vpp_ module, starting from release v2.11-5
|
||||||
|
[[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
|
||||||
|
|
||||||
|
Here's a minimalist configuration:
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@vpp0-0:~$ cat /etc/hsflowd.conf
|
||||||
|
sflow {
|
||||||
|
collector { ip=127.0.0.1 udpport=16343 }
|
||||||
|
collector { ip=192.0.2.1 namespace=dataplane }
|
||||||
|
psample { group=1 }
|
||||||
|
vpp { osIndex=off }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||||
|
|
||||||
|
There are two important details that can be confusing at first: \
|
||||||
|
**1.** kernel network namespaces \
|
||||||
|
**2.** interface index namespaces
|
||||||
|
|
||||||
|
#### hsflowd: Network namespace
|
||||||
|
|
||||||
|
When started by systemd, `hsflowd` and VPP will normally both run in the _default_ network
|
||||||
|
namespace. Network namespaces virtualize Linux's network stack. On creation, a network namespace
|
||||||
|
contains only a loopback interface, and subsequently interfaces can be moved between namespaces.
|
||||||
|
Each network namespace will have its own set of IP addresses, its own routing table, socket listing,
|
||||||
|
connection tracking table, firewall, and other network-related resources.
|
||||||
|
|
||||||
|
Given this, I can conclude that when the sFlow plugin opens a Netlink channel, it will
|
||||||
|
naturally do this in the network namespace that its VPP process is running in (the _default_
|
||||||
|
namespace by default). It is therefore important that the recipient of all the Netlink messages,
|
||||||
|
notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them in a
|
||||||
|
different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.
|
||||||
|
|
||||||
|
It might pose a problem if the network connectivity lives in a different namespace than the default
|
||||||
|
one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface
|
||||||
|
pairs, _LIPs_, in a dataplane namespace. The main reason for doing this is to allow something like
|
||||||
|
FRR or Bird to completely govern the routing table in the kernel and keep it in-sync with the FIB in
|
||||||
|
VPP. In such a _dataplane_ network namespace, typically every interface is owned by VPP.
|
||||||
|
|
||||||
|
Luckily, `hsflowd` can attach to one (default) namespace to get the PSAMPLEs, but create a socket in
|
||||||
|
a _different_ (dataplane) namespace to send packets to a collector. This explains the second
|
||||||
|
_collector_ entry in the config-file above. Here, `hsflowd` will send UDP packets to 192.0.2.1:6343
|
||||||
|
from within the (VPP) dataplane namespace, and to 127.0.0.1:16343 in the default namespace.
|
||||||
|
|
||||||
|
#### hsflowd: osIndex
|
||||||
|
|
||||||
|
I hope the previous section made some sense, because this one will be a tad more esoteric. When
|
||||||
|
creating a network namespace, each interface will get its own uint32 interface index that identifies
|
||||||
|
it, and such an ID is typically called an `ifIndex`. It's important to note that the same number can
|
||||||
|
(and will!) occur multiple times, once for each namespace. Let me give you an example:
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@summer:~$ ip link
|
||||||
|
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
|
||||||
|
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
||||||
|
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ipng-sl state UP ...
|
||||||
|
link/ether 00:22:19:6a:46:2e brd ff:ff:ff:ff:ff:ff
|
||||||
|
altname enp1s0f0
|
||||||
|
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 900 qdisc mq master ipng-sl state DOWN ...
|
||||||
|
link/ether 00:22:19:6a:46:30 brd ff:ff:ff:ff:ff:ff
|
||||||
|
altname enp1s0f1
|
||||||
|
|
||||||
|
pim@summer:~$ ip netns exec dataplane ip link
|
||||||
|
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
|
||||||
|
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
||||||
|
2: loop0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
|
||||||
|
link/ether de:ad:00:00:00:00 brd ff:ff:ff:ff:ff:ff
|
||||||
|
3: xe1-0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
|
||||||
|
link/ether 00:1b:21:bd:c7:18 brd ff:ff:ff:ff:ff:ff
|
||||||
|
```
|
||||||
|
|
||||||
|
I want to draw your attention to the number at the beginning of the line. In the _default_
|
||||||
|
namespace, `ifIndex=3` corresponds to `ifName=eno2` (which has no link, it's marked `DOWN`). But in
|
||||||
|
the _dataplane_ namespace, that index corresponds to a completely different interface called
|
||||||
|
`ifName=xe1-0` (which is link `UP`).
|
||||||
|
|
||||||
|
Now, let me show you the interfaces in VPP:
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@summer:~$ vppctl show int | grep Gigabit | egrep 'Name|loop0|tap0|Gigabit'
|
||||||
|
Name Idx State MTU (L3/IP4/IP6/MPLS)
|
||||||
|
GigabitEthernet4/0/0 1 up 9000/0/0/0
|
||||||
|
GigabitEthernet4/0/1 2 down 9000/0/0/0
|
||||||
|
GigabitEthernet4/0/2 3 down 9000/0/0/0
|
||||||
|
GigabitEthernet4/0/3 4 down 9000/0/0/0
|
||||||
|
TenGigabitEthernet5/0/0 5 up 9216/0/0/0
|
||||||
|
TenGigabitEthernet5/0/1 6 up 9216/0/0/0
|
||||||
|
loop0 7 up 9216/0/0/0
|
||||||
|
tap0 19 up 9216/0/0/0
|
||||||
|
```
|
||||||
|
|
||||||
|
Here, I want you to look at the second column `Idx`, which shows what VPP calls the _sw_if_index_
|
||||||
|
(the software interface index, as opposed to hardware index). Here, `ifIndex=3` corresponds to
|
||||||
|
`ifName=GigabitEthernet4/0/2`, which is neither `eno2` nor `xe1-0`. Oh my, yet _another_ namespace!
|
||||||
|
|
||||||
|
So there are three (relevant) types of namespaces at play here:
|
||||||
|
1. ***Linux network*** namespace; here using `dataplane` and `default` each with overlapping numbering
|
||||||
|
1. ***VPP hardware*** interface namespace, also called PHYs (for physical interfaces). When VPP
|
||||||
|
first attaches to or creates fundamental network interfaces like from DPDK or RDMA, these will
|
||||||
|
create an _hw_if_index_ in a list.
|
||||||
|
1. ***VPP software*** interface namespace. Any interface (including hardware ones!) will also
|
||||||
|
receive a _sw_if_index_ in VPP. A good example is sub-interfaces: if I create a sub-int on
|
||||||
|
GigabitEthernet4/0/2, it will NOT get a hardware index, but it _will_ get the next available
|
||||||
|
software index (in this example, `sw_if_index=7`).
|
||||||
|
|
||||||
|
In Linux CP, I can map one into the other, look at this:
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@summer:~$ vppctl show lcp
|
||||||
|
lcp default netns dataplane
|
||||||
|
lcp lcp-auto-subint off
|
||||||
|
lcp lcp-sync on
|
||||||
|
lcp lcp-sync-unnumbered on
|
||||||
|
itf-pair: [0] loop0 tap0 loop0 2 type tap netns dataplane
|
||||||
|
itf-pair: [1] TenGigabitEthernet5/0/0 tap1 xe1-0 3 type tap netns dataplane
|
||||||
|
itf-pair: [2] TenGigabitEthernet5/0/1 tap2 xe1-1 4 type tap netns dataplane
|
||||||
|
itf-pair: [3] TenGigabitEthernet5/0/0.20 tap1.20 xe1-0.20 5 type tap netns dataplane
|
||||||
|
```
|
||||||
|
|
||||||
|
Those `itf-pair` describe our _LIPs_, and they have the coordinates to three things. 1) The VPP
|
||||||
|
software interface (VPP `ifName=loop0` with `sw_if_index=7`), which 2) Linux CP will mirror into the
|
||||||
|
Linux kernel using a TAP device (VPP `ifName=tap0` with `sw_if_index=19`). That TAP has one leg in
|
||||||
|
VPP (`tap0`), and another in 3) Linux (with `ifName=loop` and `ifIndex=2` in namespace `dataplane`).
|
||||||
|
|
||||||
|
> So the tuple that fully describes a _LIP_ is `{7, 19,'dataplane', 2}`
|
||||||
|
|
||||||
|
Climbing back out of that rabbit hole, I am now finally ready to explain the feature. When sFlow in
|
||||||
|
VPP takes its sample, it will be doing this on a PHY, that is a given interface with a specific
|
||||||
|
_hw_if_index_. When it polls the counters, it'll do it for that specific _hw_if_index_. It now has a
|
||||||
|
choice: should it share with the world the representation of *its* namespace, or should it try to be
|
||||||
|
smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the
|
||||||
|
plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, look up a _LIP_
|
||||||
|
with it. If it finds one, it'll know both the namespace in which it lives as well as the osIndex in
|
||||||
|
that namespace. If it doesn't, it will at least have the _sw_if_index_ at hand, so it'll annotate the
|
||||||
|
USERSOCK with this information.
|
||||||
|
|
||||||
|
Now, `hsflowd` has a choice to make: does it share the Linux representation and hide VPP as an
|
||||||
|
implementation detail? Or does it share the VPP dataplane _sw_if_index_? There are use cases
|
||||||
|
relevant to both, so the decision was to let the operator decide, by setting `osIndex` either `on`
|
||||||
|
(use Linux ifIndex) or `off` (use VPP sw_if_index).
|
||||||
|
|
||||||
|
### hsflowd: Host Counters
|
||||||
|
|
||||||
|
Now that I understand the configuration parts of VPP and `hsflowd`, I decide to configure everything
|
||||||
|
but I don't turn on any interfaces yet in VPP. Once I start the daemon, I can see that it
|
||||||
|
sends an UDP packet every 30 seconds to the configured _collector_:
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n
|
||||||
|
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
|
||||||
|
listening on lo, link-type EN10MB (Ethernet), snapshot length 9000 bytes
|
||||||
|
15:34:19.695042 IP 127.0.0.1.48753 > 127.0.0.1.6343: sFlowv5,
|
||||||
|
IPv4 agent 198.19.5.16, agent-id 100000, length 716
|
||||||
|
```
|
||||||
|
|
||||||
|
The `tcpdump` I have on my Debian bookworm machines doesn't know how to decode the contents of these
|
||||||
|
sFlow packets. Actually, neither does Wireshark. I've attached a file of these mysterious packets
|
||||||
|
[[sflow-host.pcap](/assets/sflow/sflow-host.pcap)] in case you want to take a look.
|
||||||
|
Neil however gives me a tip. A full message decoder and otherwise handy Swiss army knife lives in
|
||||||
|
[[sflowtool](https://github.com/sflow/sflowtool)].
|
||||||
|
|
||||||
|
I can offer this pcap file to `sflowtool`, or let it just listen on the UDP port directly, and
|
||||||
|
it'll tell me what it finds:
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@vpp0-0:~$ sflowtool -p 6343
|
||||||
|
startDatagram =================================
|
||||||
|
datagramSourceIP 127.0.0.1
|
||||||
|
datagramSize 716
|
||||||
|
unixSecondsUTC 1739112018
|
||||||
|
localtime 2025-02-09T15:40:18+0100
|
||||||
|
datagramVersion 5
|
||||||
|
agentSubId 100000
|
||||||
|
agent 198.19.5.16
|
||||||
|
packetSequenceNo 57
|
||||||
|
sysUpTime 987398
|
||||||
|
samplesInPacket 1
|
||||||
|
startSample ----------------------
|
||||||
|
sampleType_tag 0:4
|
||||||
|
sampleType COUNTERSSAMPLE
|
||||||
|
sampleSequenceNo 33
|
||||||
|
sourceId 2:1
|
||||||
|
counterBlock_tag 0:2001
|
||||||
|
adaptor_0_ifIndex 2
|
||||||
|
adaptor_0_MACs 1
|
||||||
|
adaptor_0_MAC_0 525400f00100
|
||||||
|
counterBlock_tag 0:2010
|
||||||
|
udpInDatagrams 123904
|
||||||
|
udpNoPorts 23132459
|
||||||
|
udpInErrors 0
|
||||||
|
udpOutDatagrams 46480629
|
||||||
|
udpRcvbufErrors 0
|
||||||
|
udpSndbufErrors 0
|
||||||
|
udpInCsumErrors 0
|
||||||
|
counterBlock_tag 0:2009
|
||||||
|
tcpRtoAlgorithm 1
|
||||||
|
tcpRtoMin 200
|
||||||
|
tcpRtoMax 120000
|
||||||
|
tcpMaxConn 4294967295
|
||||||
|
tcpActiveOpens 0
|
||||||
|
tcpPassiveOpens 30
|
||||||
|
tcpAttemptFails 0
|
||||||
|
tcpEstabResets 0
|
||||||
|
tcpCurrEstab 1
|
||||||
|
tcpInSegs 89120
|
||||||
|
tcpOutSegs 86961
|
||||||
|
tcpRetransSegs 59
|
||||||
|
tcpInErrs 0
|
||||||
|
tcpOutRsts 4
|
||||||
|
tcpInCsumErrors 0
|
||||||
|
counterBlock_tag 0:2008
|
||||||
|
icmpInMsgs 23129314
|
||||||
|
icmpInErrors 32
|
||||||
|
icmpInDestUnreachs 0
|
||||||
|
icmpInTimeExcds 23129282
|
||||||
|
icmpInParamProbs 0
|
||||||
|
icmpInSrcQuenchs 0
|
||||||
|
icmpInRedirects 0
|
||||||
|
icmpInEchos 0
|
||||||
|
icmpInEchoReps 32
|
||||||
|
icmpInTimestamps 0
|
||||||
|
icmpInAddrMasks 0
|
||||||
|
icmpInAddrMaskReps 0
|
||||||
|
icmpOutMsgs 0
|
||||||
|
icmpOutErrors 0
|
||||||
|
icmpOutDestUnreachs 23132467
|
||||||
|
icmpOutTimeExcds 0
|
||||||
|
icmpOutParamProbs 23132467
|
||||||
|
icmpOutSrcQuenchs 0
|
||||||
|
icmpOutRedirects 0
|
||||||
|
icmpOutEchos 0
|
||||||
|
icmpOutEchoReps 0
|
||||||
|
icmpOutTimestamps 0
|
||||||
|
icmpOutTimestampReps 0
|
||||||
|
icmpOutAddrMasks 0
|
||||||
|
icmpOutAddrMaskReps 0
|
||||||
|
counterBlock_tag 0:2007
|
||||||
|
ipForwarding 2
|
||||||
|
ipDefaultTTL 64
|
||||||
|
ipInReceives 46590552
|
||||||
|
ipInHdrErrors 0
|
||||||
|
ipInAddrErrors 0
|
||||||
|
ipForwDatagrams 0
|
||||||
|
ipInUnknownProtos 0
|
||||||
|
ipInDiscards 0
|
||||||
|
ipInDelivers 46402357
|
||||||
|
ipOutRequests 69613096
|
||||||
|
ipOutDiscards 0
|
||||||
|
ipOutNoRoutes 80
|
||||||
|
ipReasmTimeout 0
|
||||||
|
ipReasmReqds 0
|
||||||
|
ipReasmOKs 0
|
||||||
|
ipReasmFails 0
|
||||||
|
ipFragOKs 0
|
||||||
|
ipFragFails 0
|
||||||
|
ipFragCreates 0
|
||||||
|
counterBlock_tag 0:2005
|
||||||
|
disk_total 6253608960
|
||||||
|
disk_free 2719039488
|
||||||
|
disk_partition_max_used 56.52
|
||||||
|
disk_reads 11512
|
||||||
|
disk_bytes_read 626214912
|
||||||
|
disk_read_time 48469
|
||||||
|
disk_writes 1058955
|
||||||
|
disk_bytes_written 8924332032
|
||||||
|
disk_write_time 7954804
|
||||||
|
counterBlock_tag 0:2004
|
||||||
|
mem_total 8326963200
|
||||||
|
mem_free 5063872512
|
||||||
|
mem_shared 0
|
||||||
|
mem_buffers 86425600
|
||||||
|
mem_cached 827752448
|
||||||
|
swap_total 0
|
||||||
|
swap_free 0
|
||||||
|
page_in 306365
|
||||||
|
page_out 4357584
|
||||||
|
swap_in 0
|
||||||
|
swap_out 0
|
||||||
|
counterBlock_tag 0:2003
|
||||||
|
cpu_load_one 0.030
|
||||||
|
cpu_load_five 0.050
|
||||||
|
cpu_load_fifteen 0.040
|
||||||
|
cpu_proc_run 1
|
||||||
|
cpu_proc_total 138
|
||||||
|
cpu_num 2
|
||||||
|
cpu_speed 1699
|
||||||
|
cpu_uptime 1699306
|
||||||
|
cpu_user 64269210
|
||||||
|
cpu_nice 1810
|
||||||
|
cpu_system 34690140
|
||||||
|
cpu_idle 3234293560
|
||||||
|
cpu_wio 3568580
|
||||||
|
cpuintr 0
|
||||||
|
cpu_sintr 5687680
|
||||||
|
cpuinterrupts 1596621688
|
||||||
|
cpu_contexts 3246142972
|
||||||
|
cpu_steal 329520
|
||||||
|
cpu_guest 0
|
||||||
|
cpu_guest_nice 0
|
||||||
|
counterBlock_tag 0:2006
|
||||||
|
nio_bytes_in 250283
|
||||||
|
nio_pkts_in 2931
|
||||||
|
nio_errs_in 0
|
||||||
|
nio_drops_in 0
|
||||||
|
nio_bytes_out 370244
|
||||||
|
nio_pkts_out 1640
|
||||||
|
nio_errs_out 0
|
||||||
|
nio_drops_out 0
|
||||||
|
counterBlock_tag 0:2000
|
||||||
|
hostname vpp0-0
|
||||||
|
UUID ec933791-d6af-7a93-3b8d-aab1a46d6faa
|
||||||
|
machine_type 3
|
||||||
|
os_name 2
|
||||||
|
os_release 6.1.0-26-amd64
|
||||||
|
endSample ----------------------
|
||||||
|
endDatagram =================================
|
||||||
|
```
|
||||||
|
|
||||||
|
If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might I
|
||||||
|
agree with you. But it is really cool to see that every 30 seconds, the _collector_ will receive
|
||||||
|
this form of heartbeat from the _agent_. There's a lot of vitalsigns in this packet, including some
|
||||||
|
non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version
|
||||||
|
information. It's super dope!
|
||||||
|
|
||||||
|
### hsflowd: Interface Counters
|
||||||
|
|
||||||
|
Next, I'll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to
|
||||||
|
something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed,
|
||||||
|
every ten seconds or so I get a few packets, which I captured them in
|
||||||
|
[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Some of the packets contain only one
|
||||||
|
counter record, while others contain more than one (in the PCAP, packet #9 has two). If I update the
|
||||||
|
polling-interval to every second, I can see that most of the packets have four counters.
|
||||||
|
|
||||||
|
Those interface counters, as decoded by `sflowtool`, look like this:
|
||||||
|
|
||||||
|
```
|
||||||
|
pim@vpp0-0:~$ sflowtool -r sflow-interface.pcap | \
|
||||||
|
awk '/startSample/ { on=1 } { if (on) { print $0 } } /endSample/ { on=0 }'
|
||||||
|
startSample ----------------------
|
||||||
|
sampleType_tag 0:4
|
||||||
|
sampleType COUNTERSSAMPLE
|
||||||
|
sampleSequenceNo 745
|
||||||
|
sourceId 0:3
|
||||||
|
counterBlock_tag 0:1005
|
||||||
|
ifName GigabitEthernet10/0/2
|
||||||
|
counterBlock_tag 0:1
|
||||||
|
ifIndex 3
|
||||||
|
networkType 6
|
||||||
|
ifSpeed 0
|
||||||
|
ifDirection 1
|
||||||
|
ifStatus 3
|
||||||
|
ifInOctets 858282015
|
||||||
|
ifInUcastPkts 780540
|
||||||
|
ifInMulticastPkts 0
|
||||||
|
ifInBroadcastPkts 0
|
||||||
|
ifInDiscards 0
|
||||||
|
ifInErrors 0
|
||||||
|
ifInUnknownProtos 0
|
||||||
|
ifOutOctets 1246716016
|
||||||
|
ifOutUcastPkts 975772
|
||||||
|
ifOutMulticastPkts 0
|
||||||
|
ifOutBroadcastPkts 0
|
||||||
|
ifOutDiscards 127
|
||||||
|
ifOutErrors 28
|
||||||
|
ifPromiscuousMode 0
|
||||||
|
endSample ----------------------
|
||||||
|
```
|
||||||
|
|
||||||
|
What I find particularly cool about it, is that sFlow provides an automatic mapping between the
|
||||||
|
`ifName=GigabitEthernet10/0/2` (tag 0:1005), together and an object (tag 0:1), which contains the
|
||||||
|
`ifIndex=3`, and lots of packet and octet counters both in the ingress and egress direction. This is
|
||||||
|
super useful for upstream _collectors_, as they can now find the hostname, agent name and address,
|
||||||
|
and the correlation between interface names and their indexes. Noice!
|
||||||
|
|
||||||
|
#### hsflowd: Packet Samples
|
||||||
|
|
||||||
|
Now it's time to ratchet up the packet sampling, so I move it from 1:100M to 1:1000, while keeping
|
||||||
|
the interface polling-interval at 10 seconds and I ask VPP to sample 64 bytes of each packet that it
|
||||||
|
inspects. On either side of my pet VPP instance, I start an `iperf3` run to generate some traffic. I
|
||||||
|
now see a healthy stream of sflow packets coming in on port 6343. They still contain every 30
|
||||||
|
seconds or so a host counter, and every 10 seconds a set of interface counters come by, but mostly
|
||||||
|
these UDP packets are showing me samples. I've captured a few minutes of these in
|
||||||
|
[[sflow-all.pcap](/assets/sflow/sflow-all.pcap)].
|
||||||
|
Although Wireshark doesn't know how to interpret the sFlow counter messages, it _does_ know how to
|
||||||
|
interpret the sFlow sample messagess, and it reveals one of them like this:
|
||||||
|
|
||||||
|
{{< image width="100%" src="/assets/sflow/sflow-wireshark.png" alt="sFlow Wireshark" >}}
|
||||||
|
|
||||||
|
Let me take a look at the picture from top to bottom. First, the outer header (from 127.0.0.1:48753
|
||||||
|
to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as
|
||||||
|
having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it's going to
|
||||||
|
send 9 samples, the first of which says it's from ifIndex=2 and at a sampling rate of 1:1000. It
|
||||||
|
then shows the sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
|
||||||
|
are sampled. Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and
|
||||||
|
DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my `iperf3`,
|
||||||
|
booyah!
|
||||||
|
|
||||||
|
### VPP: sFlow Performance
|
||||||
|
|
||||||
|
{{< image float="right" src="/assets/sflow/sflow-lab.png" alt="sFlow Lab" width="20em" >}}
|
||||||
|
|
||||||
|
One question I get a lot about this plugin is: what is the performance impact when using
|
||||||
|
sFlow? I spent a considerable amount of time tinkering with this, and together with Neil bringing
|
||||||
|
the plugin to what we both agree is the most efficient use of CPU. We could go a bit further, but
|
||||||
|
that would require somewhat intrusive changes to VPP's internals and as _North of the Border_ would
|
||||||
|
say: what we have isn't just good, it's good enough!
|
||||||
|
|
||||||
|
I've built a small testbed based on two Dell R730 machines. On the left, I have a Debian machine
|
||||||
|
running Cisco T-Rex using four quad-tengig network cards, the classic Intel i710-DA4. On the right,
|
||||||
|
I have my VPP machine called _Hippo_ (because it's always hungry for packets), with the same
|
||||||
|
hardware. I'll build two halves. On the top NIC (Te3/0/0-3 in VPP), I will install IPv4 and MPLS
|
||||||
|
forwarding on the purple circuit, and a simple Layer2 cross connect on the cyan circuit. On all four
|
||||||
|
interfaces, I will enable sFlow. Then, I will mirror this configuration on the bottom NIC
|
||||||
|
(Te130/0/0-3) in the red and green circuits, for which I will leave sFlow turned off.
|
||||||
|
|
||||||
|
To help you reproduce my results, and under the assumption that this is your jam, here's the
|
||||||
|
configuration for all of the kit:
|
||||||
|
|
||||||
|
***0. Cisco T-Rex***
|
||||||
|
```
|
||||||
|
pim@trex:~ $ cat /srv/trex/8x10.yaml
|
||||||
|
- version: 2
|
||||||
|
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
|
||||||
|
port_info:
|
||||||
|
- src_mac: 00:1b:21:06:00:00
|
||||||
|
dest_mac: 9c:69:b4:61:a1:dc # Connected to Hippo Te3/0/0, purple
|
||||||
|
- src_mac: 00:1b:21:06:00:01
|
||||||
|
dest_mac: 9c:69:b4:61:a1:dd # Connected to Hippo Te3/0/1, purple
|
||||||
|
- src_mac: 00:1b:21:83:00:00
|
||||||
|
dest_mac: 00:1b:21:83:00:01 # L2XC via Hippo Te3/0/2, cyan
|
||||||
|
- src_mac: 00:1b:21:83:00:01
|
||||||
|
dest_mac: 00:1b:21:83:00:00 # L2XC via Hippo Te3/0/3, cyan
|
||||||
|
|
||||||
|
- src_mac: 00:1b:21:87:00:00
|
||||||
|
dest_mac: 9c:69:b4:61:75:d0 # Connected to Hippo Te130/0/0, red
|
||||||
|
- src_mac: 00:1b:21:87:00:01
|
||||||
|
dest_mac: 9c:69:b4:61:75:d1 # Connected to Hippo Te130/0/1, red
|
||||||
|
- src_mac: 9c:69:b4:85:00:00
|
||||||
|
dest_mac: 9c:69:b4:85:00:01 # L2XC via Hippo Te130/0/2, green
|
||||||
|
- src_mac: 9c:69:b4:85:00:01
|
||||||
|
dest_mac: 9c:69:b4:85:00:00 # L2XC via Hippo Te130/0/3, green
|
||||||
|
pim@trex:~ $ sudo t-rex-64 -i -c 4 --cfg /srv/trex/8x10.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
When constructing the T-Rex configuration, I specifically set the destination MAC address for L3
|
||||||
|
circuits (the purple and red ones) using Hippo's interface MAC address, which I can find with
|
||||||
|
`vppctl show hardware-interfaces`. This way, T-Rex does not have to ARP for the VPP endpoint. On
|
||||||
|
L2XC circuits (the cyan and green ones), VPP does not concern itself with the MAC addressing at
|
||||||
|
all. It puts its interface in _promiscuous_ mode, and simply writes out any ethernet frame received,
|
||||||
|
directly to the egress interface.
|
||||||
|
|
||||||
|
***1. IPv4***
|
||||||
|
```
|
||||||
|
hippo# set int state TenGigabitEthernet3/0/0 up
|
||||||
|
hippo# set int state TenGigabitEthernet3/0/1 up
|
||||||
|
hippo# set int state TenGigabitEthernet130/0/0 up
|
||||||
|
hippo# set int state TenGigabitEthernet130/0/1 up
|
||||||
|
hippo# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
|
||||||
|
hippo# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
|
||||||
|
hippo# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
|
||||||
|
hippo# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
|
||||||
|
hippo# ip route add 16.0.0.0/24 via 100.64.0.0
|
||||||
|
hippo# ip route add 48.0.0.0/24 via 100.64.1.0
|
||||||
|
hippo# ip route add 16.0.2.0/24 via 100.64.4.0
|
||||||
|
hippo# ip route add 48.0.2.0/24 via 100.64.5.0
|
||||||
|
hippo# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
|
||||||
|
hippo# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
|
||||||
|
hippo# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
|
||||||
|
hippo# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
|
||||||
|
```
|
||||||
|
|
||||||
|
By the way, one note to this last piece, I'm setting static IPv4 neighbors so that Cisco T-Rex
|
||||||
|
as well as VPP do not have to use ARP to resolve each other. You'll see above that the T-Rex
|
||||||
|
configuration also uses MAC addresses exclusively. Setting the `ip neighbor` like this allows VPP
|
||||||
|
to know where to send return traffic.
|
||||||
|
|
||||||
|
***2. MPLS***
|
||||||
|
```
|
||||||
|
hippo# mpls table add 0
|
||||||
|
hippo# set interface mpls TenGigabitEthernet3/0/0 enable
|
||||||
|
hippo# set interface mpls TenGigabitEthernet3/0/1 enable
|
||||||
|
hippo# set interface mpls TenGigabitEthernet130/0/0 enable
|
||||||
|
hippo# set interface mpls TenGigabitEthernet130/0/1 enable
|
||||||
|
hippo# mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
|
||||||
|
hippo# mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
|
||||||
|
hippo# mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
|
||||||
|
hippo# mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
|
||||||
|
```
|
||||||
|
|
||||||
|
Here, the MPLS configuration implements a simple P-router, where incoming MPLS packets with label 16
|
||||||
|
will be sent back to T-Rex on Te3/0/1 to the specified IPv4 nexthop (for which we already know the
|
||||||
|
MAC address), and with label 16 removed and new label 17 imposed, in other words a SWAP operation.
|
||||||
|
|
||||||
|
***3. L2XC***
|
||||||
|
```
|
||||||
|
hippo# set int state TenGigabitEthernet3/0/2 up
|
||||||
|
hippo# set int state TenGigabitEthernet3/0/3 up
|
||||||
|
hippo# set int state TenGigabitEthernet130/0/2 up
|
||||||
|
hippo# set int state TenGigabitEthernet130/0/3 up
|
||||||
|
hippo# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
|
||||||
|
hippo# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
|
||||||
|
hippo# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
|
||||||
|
hippo# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
|
||||||
|
```
|
||||||
|
|
||||||
|
I've added a layer2 cross connect as well because it's computationally very cheap for VPP to receive
|
||||||
|
an L2 (ethernet) datagram, and immediately transmit it on another interface. There's no FIB lookup
|
||||||
|
and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as
|
||||||
|
fast as it can!
|
||||||
|
|
||||||
|
Here's how a loadtest looks like when sending 80Gbps at 256b packets on all eight interfaces:
|
||||||
|
|
||||||
|
{{< image src="/assets/sflow/sflow-lab-trex.png" alt="sFlow T-Rex" width="100%" >}}
|
||||||
|
|
||||||
|
The leftmost ports p0 <-> p1 are sending IPv4+MPLS, while ports p0 <-> p2 are sending ethernet back
|
||||||
|
and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These
|
||||||
|
four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 <-> p4 and p5 <-> p6
|
||||||
|
respectively have sFlow turned off but with the same configuration. They are my control, showing
|
||||||
|
the CPU use without sFLow.
|
||||||
|
|
||||||
|
**First conclusion**: This stuff works a treat. There is absolutely no impact of throughput at
|
||||||
|
80Gbps with 47.6Mpps either _with_, or _without_ sFlow turned on. That's wonderful news, as it shows
|
||||||
|
that the dataplane has more CPU available than is needed for any combination of functionality.
|
||||||
|
|
||||||
|
But what _is_ the limit? For this, I'll take a deeper look at the runtime statistics by varying the
|
||||||
|
CPU time spent and maximum throughput achievable on a single VPP worker, thus using a single CPU
|
||||||
|
thread on this Hippo machine that has 44 cores and 44 hyperthreads. I switch the loadtester to emit
|
||||||
|
64 byte ethernet packets, the smallest I'm allowed to send.
|
||||||
|
|
||||||
|
| Loadtest | no sFlow | 1:1'000'000 | 1:10'000 | 1:1'000 | 1:100 |
|
||||||
|
|-------------|-----------|-----------|-----------|-----------|-----------|
|
||||||
|
| L2XC | 14.88Mpps | 14.32Mpps | 14.31Mpps | 14.27Mpps | 14.15Mpps |
|
||||||
|
| IPv4 | 10.89Mpps | 9.88Mpps | 9.88Mpps | 9.84Mpps | 9.73Mpps |
|
||||||
|
| MPLS | 10.11Mpps | 9.52Mpps | 9.52Mpps | 9.51Mpps | 9.45Mpps |
|
||||||
|
| ***sFlow Packets*** / 10sec | N/A | 337.42M total | 337.39M total | 336.48M total | 333.64M total |
|
||||||
|
| .. Sampled | | 328 | 33.8k | 336k | 3.34M |
|
||||||
|
| .. Sent | | 328 | 33.8k | 336k | 1.53M |
|
||||||
|
| .. Dropped | | 0 | 0 | 0 | 1.81M |
|
||||||
|
|
||||||
|
Here I can make a few important observations.
|
||||||
|
|
||||||
|
**Baseline**: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which
|
||||||
|
implies that it has a little bit of CPU left over to do other work, if needed. With IPv4, I can see
|
||||||
|
that the throughput is CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
|
||||||
|
know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The
|
||||||
|
total capacity is 10.11Mpps when sFlow is turned off.
|
||||||
|
|
||||||
|
**Overhead**: If I turn on sFLow on the interface, VPP will insert the _sflow-node_ into the
|
||||||
|
dataplane graph between `device-input` and `ethernet-input`. It means that the sFlow node will see
|
||||||
|
_every single_ packet, and it will have to move all of these into the next node, which costs about
|
||||||
|
9.5 CPU cycles per packet. The regression on L2XC is 3.8% but I have to note that VPP was not CPU
|
||||||
|
bound on the L2XC so it used some CPU cycles which were still available, before regressing
|
||||||
|
throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS.
|
||||||
|
|
||||||
|
**Sampling Cost**: When then doing higher rates of sampling, the further regression is not _that_
|
||||||
|
terrible. Between 1:1'000'000 and 1:10'000, there's barely a noticeable difference. Even in the
|
||||||
|
worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%. The
|
||||||
|
regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS).
|
||||||
|
Of course, by using multiple hardware receive queues and multiple RX workers per interface, the cost
|
||||||
|
can be kept well in hand.
|
||||||
|
|
||||||
|
**Overload Protection**: At 1:1'000 and an effective rate of 33.65Mpps across all ports, I correctly
|
||||||
|
observe 336k samples taken, and sent to PSAMPLE. At 1:100 however, there are 3.34M samples, but
|
||||||
|
they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream
|
||||||
|
`sflow-main` thread and `hsflowd`. I can see that here, 1.81M samples have been dropped, while 1.53M
|
||||||
|
samples made it through. By the way, this means VPP is happily sending 153K samples/sec to the
|
||||||
|
collector!
|
||||||
|
|
||||||
|
## What's Next
|
||||||
|
|
||||||
|
Now that I've seen the UDP packets from our agent to a collector on the wire, and also how
|
||||||
|
incredibly efficient the sFlow sampling implementation turned out, I'm super motivated to
|
||||||
|
continue the journey with higher level collector receivers like ntopng, sflow-rt or Akvorado. In an
|
||||||
|
upcoming article, I'll describe how I rolled out Akvorado at IPng, and what types of changes would
|
||||||
|
make the user experience even better (or simpler to understand, at least).
|
||||||
|
|
||||||
|
### Acknowledgements
|
||||||
|
|
||||||
|
I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
|
||||||
|
finer details such as logging, error handling, API specifications, and documentation. He has been a
|
||||||
|
true pleasure to work with and learn from. Also, thank you to the VPP developer community, notably
|
||||||
|
Benoit, Florin, Damjan, Dave and Matt, for helping with the review and getting this thing merged in
|
||||||
|
time for the 25.02 release.
|
||||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Reference in New Issue
Block a user