Rewrite all images to Hugo format

2024-08-05 01:11:52 +02:00
parent 4230fd9acc
commit c1f1775c91
67 changed files with 29916 additions and 23 deletions
--- a/content/articles/2023-04-09-vpp-stats.md
+++ b/content/articles/2023-04-09-vpp-stats.md
@ -0,0 +1,382 @@
+---
+date: "2023-04-09T11:01:14Z"
+title: VPP - Monitoring
+---
+
+{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
+
+# About this series
+
+Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
+performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
+_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
+are shared between the two.
+
+I've been working on the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)], which you
+can read all about in my series on VPP back in 2021:
+
+[![DENOG14](/assets/vpp-stats/denog14-thumbnail.png){: style="width:300px; float: right; margin-left: 1em;"}](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)
+
+*   [[Part 1]({% post_url 2021-08-12-vpp-1 %})]: Punting traffic through TUN/TAP interfaces into Linux
+*   [[Part 2]({% post_url 2021-08-13-vpp-2 %})]: Mirroring VPP interface configuration into Linux
+*   [[Part 3]({% post_url 2021-08-15-vpp-3 %})]: Automatically creating sub-interfaces in Linux
+*   [[Part 4]({% post_url 2021-08-25-vpp-4 %})]: Synchronize link state, MTU and addresses to Linux
+*   [[Part 5]({% post_url 2021-09-02-vpp-5 %})]: Netlink Listener, synchronizing state from Linux to VPP
+*   [[Part 6]({% post_url 2021-09-10-vpp-6 %})]: Observability with LibreNMS and VPP SNMP Agent
+*   [[Part 7]({% post_url 2021-09-21-vpp-7 %})]: Productionizing and reference Supermicro fleet at IPng
+
+With this, I can make a regular server running Linux use VPP as kind of a software ASIC for super
+fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links,
+addresses and routes) itself. With Linux CP, running software like [FRR](https://frrouting.org/) or
+[Bird](https://bird.network.cz/) on top of VPP and achieving &gt;150Mpps and &gt;180Gbps forwarding
+rates are easily in reach. If you find that hard to believe, check out [[my DENOG14
+talk](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)] or click the thumbnail above. I am
+continuously surprised at the performance per watt, and the performance per Swiss Franc spent.
+
+## Monitoring VPP
+
+Of course, it's important to be able to see what routers are _doing_ in production. For the longest
+time, the _de facto_ standard for monitoring in the networking industry has been Simple Network
+Management Protocol (SNMP), described in [[RFC 1157](https://www.rfc-editor.org/rfc/rfc1157)]. But
+there's another way, using a metrics and time series system called _Borgmon_, originally designed by
+Google [[ref](https://sre.google/sre-book/practical-alerting/)] but popularized by Soundcloud in an
+open source interpretation called **Prometheus** [[ref](https://prometheus.io/)]. IPng Networks ♥ Prometheus.
+
+I'm a really huge fan of Prometheus and its graphical frontend Grafana, as you can see with my work on
+Mastodon in [[this article]({% post_url 2022-11-27-mastodon-3 %})]. Join me on
+[[ublog.tech](https://ublog.tech)] if you haven't joined the Fediverse yet. It's well monitored!
+
+### SNMP
+
+SNMP defines an extensible model by which parts of the OID (object identifier) tree can be delegated
+to another process, and the main SNMP daemon will call out to it using an _AgentX_ protocol,
+described in [[RFC 2741](https://datatracker.ietf.org/doc/html/rfc2741)].  In a nutshell, this
+allows an external program to connect to the main SNMP daemon, register an interest in certain OIDs,
+and get called whenever the SNMPd is being queried for them.
+
+{{< image width="400px" float="right" src="/assets/vpp-stats/librenms.png" alt="LibreNMS" >}}
+
+The flow is pretty simple (see section 6.2 of the RFC), the Agent (client):
+1.   opens a TCP or Unix domain socket to the SNMPd
+1.   sends an Open PDU, which the server will respond or reject.
+1.   (optionally) can send a Ping PDU, the server will respond.
+1.   registers an interest with Register PDU
+
+It then waits and gets called by the SNMPd with Get PDUs (to retrieve one single value), GetNext PDU
+(to enable snmpwalk), GetBulk PDU (to retrieve a whole subsection of the MIB), all of which are
+answered by a Response PDU.
+
+Using parts of a Python Agentx library written by GitHub user hosthvo
+[[ref](https://github.com/hosthvo/pyagentx)], I tried my hands at writing one of these AgentX's.
+The resulting source code is on [[GitHub](https://github.com/pimvanpelt/vpp-snmp-agent)]. That's the
+one that's running in production ever since I started running VPP routers at IPng Networks AS8298.
+After the _AgentX_ exposes the dataplane interfaces and their statistics into _SNMP_, an open source
+monitoring tool such as LibreNMS [[ref](https://librenms.org/)] can discover the routers and draw
+pretty graphs, as well as detect when interfaces go down, or are overloaded, and so on. That's
+pretty slick.
+
+### VPP Stats Segment in Go
+
+But if I may offer some critique on my own approach, SNMP monitoring is _very_ 1990s. I'm
+continously surpsied that our industry is still clinging on to this archaic approach. VPP offers
+_a lot_ of observability, its statistics segment is chock full of interesting counters and gauges
+that can be really helpful to understand how the dataplane performs. If there are errors or a
+bottleneck develops in the router, going over `show runtime` or `show errors` can be a life saver.
+Let's take another look at that Stats Segment (the one that the SNMP AgentX connects to in order to
+query it for packets/byte counters and interface names).
+
+You can think of the Stats Segment as a directory hierarchy where each file represents a type of
+counter. VPP comes with a small helper tool called VPP Stats FS, which uses a FUSE based read-only
+filesystem to expose those counters in an intuitive way, so let's take a look
+
+```
+pim@hippo:~/src/vpp/extras/vpp_stats_fs$ sudo systemctl start vpp
+pim@hippo:~/src/vpp/extras/vpp_stats_fs$ sudo make start
+pim@hippo:~/src/vpp/extras/vpp_stats_fs$ mount | grep stats
+rawBridge on /run/vpp/stats_fs_dir type fuse.rawBridge (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
+
+pim@hippo:/run/vpp/stats_fs_dir$ ls -la
+drwxr-xr-x 0 root root 0 Apr  9 14:07 bfd
+drwxr-xr-x 0 root root 0 Apr  9 14:07 buffer-pools
+drwxr-xr-x 0 root root 0 Apr  9 14:07 err
+drwxr-xr-x 0 root root 0 Apr  9 14:07 if
+drwxr-xr-x 0 root root 0 Apr  9 14:07 interfaces
+drwxr-xr-x 0 root root 0 Apr  9 14:07 mem
+drwxr-xr-x 0 root root 0 Apr  9 14:07 net
+drwxr-xr-x 0 root root 0 Apr  9 14:07 node
+drwxr-xr-x 0 root root 0 Apr  9 14:07 nodes
+drwxr-xr-x 0 root root 0 Apr  9 14:07 sys
+
+pim@hippo:/run/vpp/stats_fs_dir$ cat sys/boottime
+1681042046.00
+pim@hippo:/run/vpp/stats_fs_dir$ date +%s
+1681042058
+pim@hippo:~/src/vpp/extras/vpp_stats_fs$ sudo make stop
+```
+
+There's lots of really interesting stuff in here - for example in the `/sys` hierarchy we can see a
+`boottime` file, and from there I can determine the uptime of the process. Further, the `/mem`
+hierarchy shows the current memory usage for each of the _main_, _api_ and _stats_ segment heaps.
+And of course, in the `/interfaces` hierarchy we can see all the usual packets and bytes counters
+for any interface created in the dataplane.
+
+### VPP Stats Segment in C
+
+I wish I were good at Go, but I never really took to the language. I'm pretty good at Python, but
+sorting through the stats segment isn't super quick as I've already noticed in the Python3 based
+[[VPP SNMP Agent](https://github.com/pimvanpelt/vpp-snmp-agent)]. I'm probably the world's least
+terrible C programmer, so maybe I can take a look at the VPP Stats Client and make sense of it. Luckily,
+there's an example already in `src/vpp/app/vpp_get_stats.c` and it reveals the following pattern:
+
+1.   assemble a vector of regular expression patterns in the hierarchy, or just `^/` to start
+1.   get a handle to the stats segment with `stats_segment_ls()` using the pattern(s)
+1.   use the handle to dump the stats segment into a vector with `stat_segment_dump()`.
+1.   iterate over the returned stats structure, each element has a type and a given name:
+     *   ***STAT_DIR_TYPE_SCALAR_INDEX***: these are floating point doubles
+     *   ***STAT_DIR_TYPE_COUNTER_VECTOR_SIMPLE***: single uint32 counter
+     *   ***STAT_DIR_TYPE_COUNTER_VECTOR_COMBINED***: two uint32 counters
+1.   freeing the used stats structure with `stat_segment_data_free()`
+
+The simple and combined stats turn out to be associative arrays, the outer of which notes the
+_thread_ and the inner of which refers to the _index_. As such, a statistic of type
+***VECTOR_SIMPLE*** can be decoded like so:
+
+```
+if (res[i].type == STAT_DIR_TYPE_COUNTER_VECTOR_SIMPLE)
+  for (k = 0; k < vec_len (res[i].simple_counter_vec); k++)
+    for (j = 0; j < vec_len (res[i].simple_counter_vec[k]); j++)
+      printf ("[%d @ %d]: %llu packets %s\n", j, k, res[i].simple_counter_vec[k][j], res[i].name);
+```
+
+The statistic of type ***VECTOR_COMBINED*** is very similar, except the union type there is a
+`combined_counter_vec[k][j]`  which has a member `.packets` and a member called `.bytes`. The
+simplest form, ***SCALAR_INDEX***, is just a single floating point number attached to the name.
+
+In principle, this should be really easy to sift through and decode. Now that I've figured that
+out, let me dump a bunch of stats with the `vpp_get_stats` tool that comes with vanilla VPP:
+
+```
+pim@chrma0:~$ vpp_get_stats dump /interfaces/TenGig.*40121 | grep -v ': 0'
+[0 @ 2]: 67057 packets /interfaces/TenGigabitEthernet81_0_0.40121/drops
+[0 @ 2]: 76125287 packets /interfaces/TenGigabitEthernet81_0_0.40121/ip4
+[0 @ 2]: 1793946 packets /interfaces/TenGigabitEthernet81_0_0.40121/ip6
+[0 @ 2]: 77919629 packets, 66184628769 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx
+[0 @ 0]: 7 packets, 610 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx
+[0 @ 1]: 26687 packets, 18771919 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx
+[0 @ 2]: 6448944 packets, 3663975508 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx
+[0 @ 3]: 138924 packets, 20599785 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx
+[0 @ 4]: 130720342 packets, 57436383614 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx
+```
+
+I can see both types of counter at play here, let me explain the first line: it is saying that the
+counter of name `/interfaces/TenGigabitEthernet81_0_0.40121/drops`, at counter index 0, CPU thread
+2, has a simple counter with value 67057. Taking the last line, this is a combined counter type with
+name `/interfaces/TenGigabitEthernet81_0_0.40121/tx` at index 0, all five CPU threads (the main
+thread and four worker threads) have all sent traffic into this interface, and the counters for each
+in packets and bytes is given.
+
+For readability's sake, my `grep -v` above doesn't print any counter that is 0. For example,
+interface `Te81/0/0` has only one receive queue, and it's bound to thread 2. The other threads will
+not receive any packets for it, consequently their `rx` counters stay zero:
+
+```
+pim@chrma0:~/src/vpp$ vpp_get_stats dump /interfaces/TenGig.*40121 | grep rx$
+[0 @ 0]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx
+[0 @ 1]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx
+[0 @ 2]: 80720186 packets, 68458816253 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx
+[0 @ 3]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx
+[0 @ 4]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx
+```
+
+### Hierarchy: Pattern Matching
+
+I quickly discover a pattern in most of these names: they start with a scope, say `/interfaces`,
+then have a path entry for the interface name, and finally a specific counter (`/rx` or `/mpls`).
+This is true also for the `/nodes` hiearchy, in which all VPP's graph nodes have a set of counters:
+
+```
+pim@chrma0:~$ vpp_get_stats dump /nodes/ip4-lookup | grep -v ': 0'
+[0 @ 1]: 11365675493301 packets /nodes/ip4-lookup/clocks
+[0 @ 2]: 3256664129799 packets /nodes/ip4-lookup/clocks
+[0 @ 3]: 28364098623954 packets /nodes/ip4-lookup/clocks
+[0 @ 4]: 30198798628761 packets /nodes/ip4-lookup/clocks
+[0 @ 1]: 80870763789 packets /nodes/ip4-lookup/vectors
+[0 @ 2]: 17392446654 packets /nodes/ip4-lookup/vectors
+[0 @ 3]: 259363625369 packets /nodes/ip4-lookup/vectors
+[0 @ 4]: 298176625181 packets /nodes/ip4-lookup/vectors
+[0 @ 1]: 49730112811 packets /nodes/ip4-lookup/calls
+[0 @ 2]: 13035172295 packets /nodes/ip4-lookup/calls
+[0 @ 3]: 109088424231 packets /nodes/ip4-lookup/calls
+[0 @ 4]: 119789874274 packets /nodes/ip4-lookup/calls
+```
+
+If you're ever seen the output of `show runtime`, it looks like this:
+
+```
+vpp# show runtime
+Thread 1 vpp_wk_0 (lcore 28)
+Time 3377500.2, 10 sec internal node vector rate 1.46 loops/sec 3301017.05
+  vector rates in 2.7440e6, out 2.7210e6, drop 3.6025e1, punt 7.2243e-5
+            Name     State   Calls        Vectors      Suspends    Clocks    Vectors/Call
+...
+ip4-lookup           active  49732141978  80873724903         0    1.41e2            1.63
+```
+
+Hey look! On thread 1, which is called `vpp_wk_0` and is running on logical CPU core #28, there are
+a bunch of VPP graph nodes that are all keeping stats of what they've been doing, and you can see
+here that the following numbers line up between `show runtime` and the VPP Stats dumper:
+
+*    ***Name***: This is the name of the VPP graph node, in this case `ip4-lookup`, which is performing an
+     IPv4 FIB lookup to figure out what the L3 nexthop is of a given IPv4 packet we're trying to route.
+*    ***Calls***: How often did we invoke this graph node, 49.7 billion times so far.
+*    ***Vectors***: How many packets did we push through, 80.87 billion, humble brag.
+*    ***Clocks***: This one is a bit different -- you can see the cumulative clock cycles spent by
+     this CPU thread in the stats dump: 11365675493301 divided by 80870763789 packets is 140.54 CPU
+     cycles per packet. It's a cool interview question "How many CPU cycles does it take to do an
+     IPv4 routing table lookup". You now know the answer :-)
+*    ***Vectors/Call***: This is a measure of how busy the node is (did it run for only one packet,
+     or for many packets?). On average when the worker thread gave the `ip4-lookup` node some work to
+     do, there have been a total of 80873724903  packets handled in 49732141978 calls, so 1.626
+     packets per call. If ever you're handling 256 packets per call (the most VPP will allow per call),
+     your router will be sobbing.
+
+### Prometheus Metrics
+
+Prometheus has metrics which carry a name, and zero or more labels. The prometheus query language
+can then use these labels to do aggregation, division, averages, and so on. As a practical example,
+above I looked at interface stats and saw that the Rx/Tx numbers were counted one per thread. If
+we'd like the total on the interface, it would be great if we could `sum without (thread,index)`,
+which will have the effect of adding all of these numbers together.
+For the monotonically increasing counter numbers (like the total vectors/calls/clocks per node), we
+can take the running _rate_ of change, showing the time spent over the last minute, or so. This way,
+spikes in traffic will clearly correlate both with a spike in packets/sec or bytes/sec on the
+interface, but also a higher number of _vectors/call_, and correspondingly typically a lower number
+of _clocks/vector_, as VPP gets more efficient when it can re-use the CPU's instruction and data
+cache to do repeat work on multiple packets.
+
+I decide to massage the statistic names a little bit, by transforming them of the basic format:
+`prefix_suffix{label="X",index="A",thread="B"} value`
+
+A few examples:
+*   The single counter that looks like `[6 @ 0]: 994403888 packets /mem/main heap` becomes:
+    *    `mem{heap="main heap",index="6",thread="0"}`
+*   The combined counter `[0 @ 1]: 79582338270 packets, 16265349667188 bytes /interfaces/Te1_0_2/rx`
+    becomes:
+    *   `interfaces_rx_packets{interface="Te1_0_2",index="0",thread="1"} 79582338270`
+    *   `interfaces_rx_bytes{interface="Te1_0_2",index="0",thread="1"} 16265349667188`
+*   The node information running on, say thread 4, becomes:
+    *   `nodes_clocks{node="ip4-lookup",index="0",thread="4"} 30198798628761`
+    *   `nodes_vectors{node="ip4-lookup",index="0",thread="4"} 298176625181`
+    *   `nodes_calls{node="ip4-lookup",index="0",thread="4"} 119789874274`
+    *   `nodes_suspends{node="ip4-lookup",index="0",thread="4"} 0`
+
+### VPP Exporter
+
+I wish I had things like `split()` and `re.match()` but in C (well, I guess I do have POSIX regular
+expressions...), but it's all a little bit more low level. Based on my basic loop that opens the
+stats segment, registers its desired patterns, and then retrieves a vector of {name, type,
+counter}-tuples, I decide to do a little bit of non-intrusive string tokenization first:
+
+```
+static int tokenize (const char *str, char delimiter, char **tokens, int *lengths) {
+  char *p = (char *) str;
+  char *savep = p;
+  int i = 0;
+
+  while (*p) if (*p == delimiter) {
+      tokens[i] = (char *) savep;
+      lengths[i] = (int) (p - savep);
+      i++; p++; savep = p;
+    } else p++;
+  tokens[i] = (char *) savep;
+  lengths[i] = (int) (p - savep);
+  return i++;
+}
+
+/* The call site */
+  char *tokens[10];
+  int lengths[10];
+  int num_tokens = tokenize (res[i].name, '/', tokens, lengths);
+```
+
+The tokenizer takes an array of N pointers to the resulting tokens, and their lengths. This sets it
+aside from `strtok()` and friends, because those will overwrite the occurences of the delimiter in
+the input string with `\0`, and as such cannot take a `const char *str` as input. This one leaves
+the string alone though, and will return the tokens as {ptr, len}-tuples, including how many
+tokens it found.
+
+One thing I'll probably regret is that there's no bounds checking on the number of tokens -- if I
+have more than 10 of these, I'll come to regret it. But for now, the depth of the hierarchy is only
+3, so I should be fine. Besides, I got into a fight with ChatGPT after it declared a romantic
+interest in my cat, so it won't write code for me anymore :-(
+
+But using this simple tokenizer, and knowledge of the structure of well known hierarchy paths, the
+rest of the exporter is quickly in hand. Some variables don't have a label (for example
+`/sys/boottime`), but those that do will see that field transposed from the directory path
+`/mem/main heap/free` into the label as I showed above.
+
+### Results
+
+{{< image width="400px" float="right" src="/assets/vpp-stats/grafana1.png" alt="Grafana 1" >}}
+
+With this VPP Prometheus Exporter, I can now hook the VPP routers up to Prometheus and Grafana.
+Aggregations in Grafana are super easy and scalable, due to the conversion of the static paths into
+dynamically created labels on the prometheus metric names.
+
+Drawing a graph of the running time spent by each individual VPP graph node might look something
+like this:
+
+```
+sum without (node)(rate(nodes_clocks[60s]))
+  /
+sum without (node)(rate(nodes_vectors[60s]))
+```
+
+The plot to the right shows a system under a loadtest that ramps up from 0% to 100% of line rate,
+and the traces are the cummulative time spent in each node (on a logarithmic scale). The top purple
+line represents `dpdk-input`. When a VPP dataplane is idle, the worker threads will be repeatedly
+polling DPDK to ask it if it has something to do, spending 100% of their time being told "there is
+nothing for you to do". But, once load starts appearing, the other nodes start spending CPU time,
+for example the chain of IPv4 forwarding is `ethernet-input`, `ip4-input`, `ip4-lookup`, followed by
+`ip4-rewrite` and ultimately the packet is transmitted on some other interface. When the system is
+lightly loaded, the `ethernet-input` node for example will spend 1100 or so CPU cycles per packet,
+but when the machine is under higher load, the time spent will decrease to as low as 22 CPU cycles
+per packet. This is true for almost all of the nodes - VPP gets relatively _more efficient_ under
+load.
+
+{{< image width="400px" float="right" src="/assets/vpp-stats/grafana2.png" alt="Grafana 2" >}}
+
+Another cool graph that I won't be able to see when using only LibreNMS and SNMP polling, is how
+busy the router is. In VPP, each dispatch of the worker loop will poll DPDK and dispatch the packets
+through the directed graph of nodes that I showed above. But how many packets can be moved through
+the graph per CPU? The largest number of packets that VPP will ever offer into a call of the nodes
+is 256. Typically an unloaded machine will have an average number of Vectors/Call of around 1.00.
+When the worker thread is loaded, it may sit at around 130-150 Vectors/Call. If it's saturated, it
+will quickly shoot up to 256.
+
+As a good approximation, Vectors/Call normalized to 100% will be an indication of how busy the
+dataplane is. In the picture above, between 10:30 and 11:00 my test router was pushing about 180Gbps
+of traffic, but with large packets so its total vectors/call was modest (roughly 35-40), which you
+can see as all threads there are running in the ~25% load range. Then at 11:00 a few threads got
+hotter, and one of them completely saturated, and the traffic being forwarded by the CPU thread was
+suffering _packetlo_, even though the others were absolutely fine... forwarding 150Mpps on a 10 year
+old Dell R720!
+
+### What's Next
+
+Together with the graph above, I can also see how many CPU cycles are spent in which
+type of operation. For example, encapsulation of GENEVE or VxLAN is not _free_, although it's also
+not every expensive. If I know how many CPU cycles are available (roughly the clock speed of the CPU
+threads, in our case Xeon X1518 (2.2GHz) or Xeon E5-2683 v4 CPUs (3GHz), I can pretty accurately
+calculate what a given mix of traffic and features is going to cost, and how many packets/sec our
+routers at IPng will be able to forward. Spoiler alert: it's way more than currently needed. Our
+supermicros can handle roughly 35Mpps each, and considering a regular mixture of internet traffic
+(called _imix_) is about 3Mpps per 10G, I will have room to spare for the time being
+
+This is super valuable information for folks running VPP in production.
+I haven't put the finishing touches on the VPP Prometheus Exporter, for example there are no
+commandline flags yet, it doesn't listen on any port other than 9482 (the same one that the toy
+exporter in `src/vpp/app/vpp_prometheus_export.c` ships with
+[[ref](https://github.com/prometheus/prometheus/wiki/Default-port-allocations)]). My grafana
+dashboard is also not fully completed yet. I hope to get that done in April, and publish both the
+exporter and the dashboard on GitHub. Stay tuned!