Readability edits

2025-02-09 22:21:18 +01:00
parent 4ac8c47127
commit 533cca0108
1 changed files with 66 additions and 63 deletions
--- a/content/articles/2025-02-08-sflow-3.md
+++ b/content/articles/2025-02-08-sflow-3.md
@ -85,13 +85,13 @@ few weeks! Last weekend, I gave a lightning talk at
 in Brussels, Belgium, and caught up with a lot of community members and network- and software
 engineers. I had a great time.

-In trying to keep the amount of code as small as possible, and therefor the probability of bugs that
+In trying to keep the amount of code as small as possible, and therefore the probability of bugs that
 might impact VPP's dataplane stability low, the architecture of the end to end solution consists of
 three distinct parts, each with their own risk and performance profile:

 {{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}}

-**sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
+**1. sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
 packets from `device-input` to the `ethernet-input` nodes in its forwarding graph, the sFlow plugin
 will inspect 1-in-N, taking a sample for further processing. Here, we don't try to be clever, simply
 copy the `inIfIndex` and the first bytes of the ethernet frame, and append them to a
@ -100,17 +100,17 @@ arrive, samples are dropped at the tail, and a counter incremented.  This way, I
 dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to
 get their fair share of samples into the Agent's hands.

-**sFlow main process**: There's a function running on the _main thread_, which shifts further
+**2. sFlow main process**: There's a function running on the _main thread_, which shifts further
 processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it
 consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones
 in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is
 configurable), it'll grab all interface counters from those interfaces for which I have sFlow
 turned on. VPP produces _Netlink_ messages and sends them to the kernel.

-**host-sflow**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
+**3. Host sFlow daemon**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
 messages. It goes without saying that `hsflowd` is a battle-hardened implementation running on
 hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy,
-this module already exists. But Neil implements a _mod_vpp_ which can grab interface names and their
+this module already exists. But Neil implemented a _mod_vpp_ which can grab interface names and their
 `ifIndex`, and counter statistics. VPP emits this data as _Netlink_ `USERSOCK` messages alongside
 the PSAMPLEs.

@ -119,8 +119,8 @@ By the way, I've written about _Netlink_ before when discussing the [[Linux Cont
 2021-08-25-vpp-4 >}})] plugin.  It's a mechanism for programs running in userspace to share
 information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from
 kernel to userspace using a _PSAMPLE_ Netlink channel. However, the pattern is that of a message
-producer/subscriber queue and nothing precludes one userspace process (VPP) to be the producer while
-another (hsflowd) is the consumer!
+producer/subscriber relationship and nothing precludes one userspace process (`vpp`) to be the
+producer while another userspace process (`hsflowd`) acts as the consumer!

 Assuming the sFlow plugin in VPP produces samples and counters properly, `hsflowd` will do the rest,
 giving correctness and upstream interoperability pretty much for free. That's slick!
@ -148,14 +148,15 @@ vpp0-0# sflow enable GigabitEthernet10/0/3
 ```

 The first three commands set the global defaults - in my case I'm going to be sampling at 1:100
-which is unusually high frequency. A production setup may take 1-in-_linkspeed-in-megabits_ so for a
+which is an unusually high rate. A production setup may take 1-in-_linkspeed-in-megabits_ so for a
 1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more
 appropriate, depending on link load. The second command sets the interface stats polling interval.
 The default is to gather these statistics every 20 seconds, but I set it to 10s here.

-Next, I instruct the plugin how many bytes of the sampled ethernet frame should be taken. Common
-values are 64 and 128. I want enough data to see the headers, like MPLS label(s), Dot1Q tag(s), IP
-header and TCP/UDP/ICMP header, but the contents of the payload are rarely interesting for
+Next, I tell the plugin how many bytes of the sampled ethernet frame should be taken. Common
+values are 64 and 128 but it doesn't have to be a power of two. I want enough data to see the
+headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but the contents of
+the payload are rarely interesting for
 statistics purposes.

 Finally, I can turn on the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP,
@ -166,8 +167,8 @@ maybe to write `sflow enable $iface disable` but it makes more logical sends if

 ***2. VPP Configuration via API***

-I implemented a few API calls for the most common operations. Here's a snippet that shows the same
-calls as from the CLI above, but using these Python API calls:
+I implemented a few API methods for the most common operations. Here's a snippet that obtains the
+same config as what I typed on the CLI above, but using these Python API calls:

 ```python
 from vpp_papi import VPPApiClient, VPPApiJSONFiles
@ -214,7 +215,7 @@ the current value. Then I set the polling interval to 10s and retrieve the curre
 Finally, I set the header bytes to 128, and retrieve the value again. 

 Enabling and disabling sFlow on interfaces shows the idiom I mentioned before - the API being an
-`enable_disable()` call of sorts, and typically taking a boolean argument if the operator wants to
+`*_enable_disable()` call of sorts, and typically taking a boolean argument if the operator wants to
 enable (the default), or disable sFlow on the interface. Getting the list of enabled interfaces can
 be done with the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details`
 messages.
@ -225,12 +226,12 @@ article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests
 ***3. VPPCfg YAML Configuration***

 Writing on the CLI and calling the API is good and all, but many users of VPP have noticed that it
-does not have any form of configuration persistence and that's on purpose. The project is a
+does not have any form of configuration persistence and that's deliberate. VPP's goal is to be a
 programmable dataplane, and explicitly has left the programming and configuration as an exercise for
-integrators. I have written a small Python program that takes a YAML file as input and uses it to
-configure (and reconfigure, on the fly) the dataplane automatically. It's called
-[[VPPcfg](https://github.com/pimvanpelt/vppcfg.git)], and I wrote some implementation thoughts on
-its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
+integrators. I have written a Python project that takes a YAML file as input and uses it to
+configure (and reconfigure, on the fly) the dataplane automatically, called
+[[VPPcfg](https://github.com/pimvanpelt/vppcfg.git)]. Previously, I wrote some implementation thoughts
+on its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
 >}})] so I won't repeat that here. Instead, I will just show the configuration:

 ```
@ -259,16 +260,16 @@ pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp
 pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp
 ```

-The slick thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
+The nifty thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
 1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the `vppcfg plan` and `vppcfg
-apply` stages and the VPP dataplane will reflect the newly declared configuration.
+apply` stages and the VPP dataplane will be reprogrammed to reflect the newly declared configuration.

 ### hsflowd: Configuration

-VPP will start to emit _Netlink_ messages, of type PSAMPLE with packet samples and of type USERSOCK
-with the custom messages about the interface names and counters. These latter custom messages have
-to be decoded, which is done by the _mod_vpp_ module, starting from release v2.11-5
-[[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
+When sFlow is enabled, VPP will start to emit _Netlink_ messages of type PSAMPLE with packet samples
+and of type USERSOCK with the custom messages containing interface names and counters. These latter
+custom messages have to be decoded, which is done by the _mod_vpp_ module in `hsflowd`, starting
+from release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].

 Here's a minimalist configuration:

@ -290,17 +291,17 @@ There are two important details that can be confusing at first: \

 #### hsflowd: Network namespace

-When started by systemd, `hsflowd` and VPP will normally both run in the _default_ network
-namespace. Network namespaces virtualize Linux's network stack. On creation, a network namespace
-contains only a loopback interface, and subsequently interfaces can be moved between namespaces.
-Each network namespace will have its own set of IP addresses, its own routing table, socket listing,
-connection tracking table, firewall, and other network-related resources. 
+Network namespaces virtualize Linux's network stack. Upon creation, a network namespace contains only
+a loopback interface, and subsequently interfaces can be moved between namespaces.  Each network
+namespace will have its own set of IP addresses, its own routing table, socket listing, connection
+tracking table, firewall, and other network-related resources.  When started by systemd, `hsflowd`
+and VPP will normally both run in the _default_ network namespace.

 Given this, I can conclude that when the sFlow plugin opens a Netlink channel, it will
 naturally do this in the network namespace that its VPP process is running in (the _default_
-namespace by default). It is therefore important that the recipient of all the Netlink messages,
-notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them in a
-different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.
+namespace, normally). It is therefore important that the recipient of these Netlink messages,
+notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them together in
+a different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.

 It might pose a problem if the network connectivity lives in a different namespace than the default
 one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface
@ -364,17 +365,18 @@ Here, I want you to look at the second column `Idx`, which shows what VPP calls
 (the software interface index, as opposed to hardware index). Here, `ifIndex=3` corresponds to
 `ifName=GigabitEthernet4/0/2`, which is neither `eno2` nor `xe1-0`. Oh my, yet _another_ namespace!

-So there are three (relevant) types of namespaces at play here:
-1.  ***Linux network*** namespace; here using `dataplane` and `default` each with overlapping numbering
+It turns out that there are three (relevant) types of namespaces at play here:
+1.  ***Linux network*** namespace; here using `dataplane` and `default` each with their own unique
+    (and overlapping) numbering.
 1.  ***VPP hardware*** interface namespace, also called PHYs (for physical interfaces). When VPP
-    first attaches to or creates fundamental network interfaces like from DPDK or RDMA, these will
+    first attaches to or creates network interfaces like the ones from DPDK or RDMA, these will
    create an _hw_if_index_ in a list.
-1.  ***VPP software*** interface namespace. Any interface (including hardware ones!) will also
+1.  ***VPP software*** interface namespace. All interfaces (including hardware ones!) will
    receive a _sw_if_index_ in VPP. A good example is sub-interfaces: if I create a sub-int on
    GigabitEthernet4/0/2, it will NOT get a hardware index, but it _will_ get the next available
    software index (in this example, `sw_if_index=7`).

-In Linux CP, I can map one into the other, look at this:
+In Linux CP, I can see a mapping from one to the other, just look at this:

 ```
 pim@summer:~$ vppctl show lcp
@ -400,21 +402,21 @@ VPP takes its sample, it will be doing this on a PHY, that is a given interface
 _hw_if_index_. When it polls the counters, it'll do it for that specific _hw_if_index_. It now has a
 choice: should it share with the world the representation of *its* namespace, or should it try to be
 smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the
-plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, look up a _LIP_
-with it. If it finds one, it'll know both the namespace in which it lives as well as the osIndex in
-that namespace. If it doesn't, it will at least have the _sw_if_index_ at hand, so it'll annotate the
-USERSOCK with this information.
+plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, try to look up a
+_LIP_ with it. If it finds one, it'll know both the namespace in which it lives as well as the
+osIndex in that namespace. If it doesn't find a _LIP_, it will at least have the _sw_if_index_ at
+hand, so it'll annotate the USERSOCK counter messages with this information instead.

 Now, `hsflowd` has a choice to make: does it share the Linux representation and hide VPP as an
 implementation detail? Or does it share the VPP dataplane _sw_if_index_? There are use cases
 relevant to both, so the decision was to let the operator decide, by setting `osIndex` either `on`
-(use Linux ifIndex) or `off` (use VPP sw_if_index).
+(use Linux ifIndex) or `off` (use VPP _sw_if_index_).

 ### hsflowd: Host Counters

 Now that I understand the configuration parts of VPP and `hsflowd`, I decide to configure everything
-but I don't turn on any interfaces yet in VPP. Once I start the daemon, I can see that it
-sends an UDP packet every 30 seconds to the configured _collector_:
+but without enabling sFlow on on any interfaces yet in VPP. Once I start the daemon, I can see that
+it sends an UDP packet every 30 seconds to the configured _collector_:

 ```
 pim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n
@ -587,7 +589,7 @@ endSample   ----------------------
 endDatagram   =================================
 ```

-If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might I
+If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might
 agree with you. But it is really cool to see that every 30 seconds, the _collector_ will receive
 this form of heartbeat from the _agent_. There's a lot of vitalsigns in this packet, including some
 non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version
@ -597,10 +599,10 @@ information. It's super dope!

 Next, I'll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to
 something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed,
-every ten seconds or so I get a few packets, which I captured them in
-[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Some of the packets contain only one
-counter record, while others contain more than one (in the PCAP, packet #9 has two). If I update the
-polling-interval to every second, I can see that most of the packets have four counters.
+every ten seconds or so I get a few packets, which I captured in
+[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Most of the packets contain only one
+counter record, while some contain more than one (in the PCAP, packet #9 has two). If I update the
+polling-interval to every second, I can see that most of the packets have all four counters.

 Those interface counters, as decoded by `sflowtool`, look like this:

@ -638,7 +640,7 @@ endSample   ----------------------
 ```

 What I find particularly cool about it, is that sFlow provides an automatic mapping between the
-`ifName=GigabitEthernet10/0/2` (tag 0:1005), together and an object (tag 0:1), which contains the
+`ifName=GigabitEthernet10/0/2` (tag 0:1005), together with an object (tag 0:1), which contains the
 `ifIndex=3`, and lots of packet and octet counters both in the ingress and egress direction. This is
 super useful for upstream _collectors_, as they can now find the hostname, agent name and address,
 and the correlation between interface names and their indexes. Noice!
@ -661,10 +663,10 @@ Let me take a look at the picture from top to bottom. First, the outer header (f
 to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as
 having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it's going to
 send 9 samples, the first of which says it's from ifIndex=2 and at a sampling rate of 1:1000. It
-then shows the sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
+then shows that sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
 are sampled.  Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and
-DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my `iperf3`,
-booyah!
+DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my running
+`iperf3`, booyah!

 ### VPP: sFlow Performance

@ -779,7 +781,7 @@ an L2 (ethernet) datagram, and immediately transmit it on another interface. The
 and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as
 fast as it can!

-Here's how a loadtest looks like when sending 80Gbps at 256b packets on all eight interfaces:
+Here's how a loadtest looks like when sending 80Gbps at 192b packets on all eight interfaces:

 {{< image src="/assets/sflow/sflow-lab-trex.png" alt="sFlow T-Rex" width="100%" >}}

@ -787,7 +789,7 @@ The leftmost ports p0 <-> p1 are sending IPv4+MPLS, while ports p0 <-> p2 are se
 and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These
 four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 <-> p4 and p5 <-> p6
 respectively have sFlow turned off but with the same configuration.  They are my control, showing
-the CPU use without sFLow.
+the CPU use without sFlow.

 **First conclusion**: This stuff works a treat. There is absolutely no impact of throughput at
 80Gbps with 47.6Mpps either _with_, or _without_ sFlow turned on. That's wonderful news, as it shows
@ -812,18 +814,19 @@ Here I can make a few important observations.

 **Baseline**: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which
 implies that it has a little bit of CPU left over to do other work, if needed.  With IPv4, I can see
-that the throughput is CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
+that the throughput is actually CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
 know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The
-total capacity is 10.11Mpps when sFlow is turned off.
+total capacity is 10.11Mpps for one worker, when sFlow is turned off.

-**Overhead**: If I turn on sFLow on the interface, VPP will insert the _sflow-node_ into the
+**Overhead**: When I turn on sFlow on the interface, VPP will insert the _sflow-node_ into the
 forwarding graph between `device-input` and `ethernet-input`. It means that the sFlow node will see
 _every single_ packet, and it will have to move all of these into the next node, which costs about
 9.5 CPU cycles per packet.  The regression on L2XC is 3.8% but I have to note that VPP was not CPU
 bound on the L2XC so it used some CPU cycles which were still available, before regressing
-throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS.
+throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS, only to shuffle the
+packets through the graph.

-**Sampling Cost**: When then doing higher rates of sampling, the further regression is not _that_
+**Sampling Cost**: But when then doing higher rates of sampling, the further regression is not _that_
 terrible.  Between 1:1'000'000 and 1:10'000, there's barely a noticeable difference. Even in the
 worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%.  The
 regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS).
@ -834,8 +837,8 @@ can be kept well in hand.
 observe 336k samples taken, and sent to PSAMPLE.  At 1:100 however, there are 3.34M samples, but
 they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream
 `sflow-main` thread and `hsflowd`. I can see that here, 1.81M samples have been dropped, while 1.53M
-samples made it through.  By the way, this means VPP is happily sending 153K samples/sec to the
-collector!
+samples made it through.  By the way, this means VPP is happily sending a whopping 153K samples/sec
+to the collector!

 ## What's Next