Rewrite all images to Hugo format

This commit is contained in:
2024-08-05 01:11:52 +02:00
parent 4230fd9acc
commit c1f1775c91
67 changed files with 29916 additions and 23 deletions

View File

@ -0,0 +1,19 @@
---
date: "2021-02-26T13:07:54Z"
title: IPng History
---
Historical context - todo, but notes for now
1. started with stack.nl (when it was still stack.urc.tue.nl), 6bone and watching NASA multicast video in 1997.
2. founded ipng.nl project, first IPv6 in NL that was usable outside of NREN.
3. attacted attention of the first few IPv6 partitipants in Amsterdam, organized the AIAD - AMS-IX IPv6 Awareness Day
4. launched IPv6 at AMS-IX, first IXP prefix allocated 2001:768:1::/48
> My Brilliant Idea Of The Day -- encode AS number in leetspeak: `::AS01:2859:1`, because who would've thought we would ever run out of 16 bit AS numbers :)
5. IPng rearchitected to SixXS, and became a very large scale deployment of IPv6 tunnelbroker; our main central provisioning system moved around a few times between ISPs (Intouch, Concepts ICT, BIT, IP Man)
6. Needed eventually a NOC and servers to operate it that were provider independent, which is where our PI space came from (and is still used)
7. High Availability with paphosting of sixxs.net and other sites
8. Moved to IP-Max in 2014 (and still best of friends with that crew!)
9. In 2019, Fred said "hey why don't you get an AS number and announce your /24 PI yourself, that'll be fun!"
> I didn't want to at first, because "it is a lot of work to do it properly".
10. In 2020, I got to know Openfactory who are a local ISP (with an office in the town I live) and offer services on the local FTTH network; so I got a gigabit with them
> And that's when I made the plunge, got AS50869, started announcing my own PI space, built up a few routers, and the rest is ... history :)

View File

@ -0,0 +1,403 @@
---
date: "2021-02-27T21:31:00Z"
title: Loadtesting at Coloclue
---
* Author: Pim van Pelt <[pim@ipng.nl](mailto:pim@ipng.nl)>
* Reviewers: Coloclue Network Committee <[routers@coloclue.net](mailto:routers@coloclue.net)>
* Status: Draft - Review - **Published**
## Introduction
Coloclue AS8283 operates several Linux routers running Bird. Over the years, the performance of their previous hardware platform (Dell R610) has deteriorated, and they were up for renewal. At the same time, network latency/jitter has been very high, and variability may be caused by the Linux router hardware, their used software, the intra-datacenter links, or any combination of these. One specific example of why this is important is that Coloclue runs BFD on their inter-datacenter links, which are VLANs provided to us by Atom86 and Fusix networks. On these links Colclue regularly sees ping times of 300-400ms, with huge outliers in the 1000ms range, which triggers BFD timeouts causing iBGP reconvergence events and overall horrible performance. Before we open up a discussion with these (excellent!) L2 providers, we should first establish if its not more likely that Coloclue's router hardware and/or software should be improved instead.
By means of example, lets take a look at a Smokeping graph that shows these latency spikes, jitter and loss quite well. Its taken from a machine at True (in EUNetworks) to a machine at Coloclue (in NorthC); this is the first graph. The same machine at True to a machine at BIT (in Ede) does not exhibit this behavior; this is the second graph.
| | |
| ---- | ---- |
| {{< image src="/assets/coloclue-loadtest/image0.png" alt="BIT" width="450px" >}} | {{< image src="/assets/coloclue-loadtest/image1.png" alt="Coloclue" width="450px" >}} |
*Images: Smokeping graph from True to Coloclue (left), and True to BIT (right). There is quite a difference.*
## Summary
I performed three separate loadtests. First, I did a loopback loadtest on the T-Rex machine, proving that it can send 1.488Mpps in both directions simultaneously. Then, I did a loadtest of the Atom86 link by sending the traffic through the Arista in NorthC, over the Atom86 link, to the Arista in EUNetworks, looping two ethernet ports, and sending the traffic back to NorthC. Due to VLAN tagging, this yielded 1.42Mpps throughput, exactly as predicted. Finally, I performed a stateful loadtest that saturated the Atom86 link, while injecting SCTP packets at 1KHz, measuring the latency observed over the Atom86 link.
**All three tests passed.**
## Loadtest Setup
After deploying the new NorthC routers (Supermicro Super Server/X11SCW-F with Intel Xeon E-2286G processors), I decided to rule out hardware issues, leaving link and software issues. To get a bit more insight on software or inter-datacenter links, I created the following two loadtest setups.
### 1. Baseline
Machine `dcg-2`, carrying an Intel 82576 quad Gigabit NIC, looped from the first two ports (port0 to port1). The point of this loopback test is to ensure that the machine itself is capable of sending and receiving the correct traffic patterns. Usually, one does an “imix” and a “64b” loadtest for this, and it is expected that the loadtester itself passes all traffic out on one port back into the other port, without any loss. The thing I am testing is called the DUT or *Device Under Test* and in this case, it is a UTP cable from NIC-NIC.
The expected packet rate is: 672 bits for the ethernet frame is 10^9 / 672 == **1488095 packets per second** in each direction and traversing the link once. You will often see 1.488Mpps as “the theoretical maximum”, and this is why.
### 2. Atom86
In this test, Tijn from Coloclue plugged `dcg-2` port0 into the core switch (an Arista) port e17, and he configured that switchport as an access port for VLAN A; which is put on the Atom86 trunk to EUNetworks. The second port1 is plugged into the core switch port e18, and assigned a different VLAN B, which is also put on the Atom86 link to EUNetworks.
At EUNetworks then, he exposed that same VLAN A on port e17 and VLAN B on port e18. And Tijn used DAC cable to connect e17 <-> e18. Thus, the path the packets travel now becomes the *Device Under Test* (DUT):
> port0 -> dcg-core-2:e17 -> Atom86 -> eunetworks-core-2:e17
>
> eunetworks-core-2:e18 -> Atom86 -> dcg-core-2:e18 -> port1
I should note that because the loadtester emits traffic which is tagged by the `*-core-2` switches, that the Atom86 link will see each tagged packet twice, and as we'll see, that VLAN tagging actually matters! The maximum expected packet rate is: 672 bits for the ethernet frame + 32 bits for the VLAN tag == 704 bits per packet, sent in both directions, but traversing the link twice. We can deduce that we should see 10^9 / 704 / 2 == **710227 packets per second** in each direction.
## Detailed Analysis
This section goes into details, but it is roughly broken down into:
1. Prepare machine (install T-Rex, needed kernel headers, and some packages)
1. Configure T-Rex (bind NIC from PCI bus into DPDK)
1. Run T-Rex interactively
1. Run T-Rex programmatically
### Step 1 - Prepare machine
Download T-Rex from Cisco website and unpack (I used version 2.88) in some directory that is readable by nobody. I used `/tmp/loadtest/` for this. Install some additional tools:
```
sudo apt install linux-headers-`uname -r` build-essential python3-distutils
```
### Step 2 - Bind NICs to DPDK
First I had to find which NICs that can be used, these NICs have to be supported in DPDK, but luckily most Intel NICs are. I had a few ethernet NICs to choose from:
```
root@dcg-2:/tmp/loadtest/v2.88# lspci | grep -i Ether
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
01:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
01:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
05:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
05:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
07:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
07:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
0c:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
0d:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
```
But the ones that have no link are a good starting point:
```
root@dcg-2:~/loadtest/v2.88# ip link | grep -v UP | grep enp7
6: enp7s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
7: enp7s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
```
This is PCI bus 7, slot 0, function 0 and 1, so the configuration file for T-Rex becomes:
```
root@dcg-2:/tmp/loadtest/v2.88# cat /etc/trex_cfg.yaml
- version : 2
interfaces : ["07:00.0","07:00.1"]
port_limit : 2
port_info :
- dest_mac : [0x0,0x0,0x0,0x1,0x0,0x00] # port 0
src_mac : [0x0,0x0,0x0,0x2,0x0,0x00]
- dest_mac : [0x0,0x0,0x0,0x2,0x0,0x00] # port 1
src_mac : [0x0,0x0,0x0,0x1,0x0,0x00]
```
### Step 3 - Run T-Rex Interactively
Start the loadtester, this is easiest if you use two terminals, one to run t-rex itself and one to run the console:
```
root@dcg-2:/tmp/loadtest/v2.88# ./t-rex-64 -i
root@dcg-2:/tmp/loadtest/v2.88# ./trex-console
```
The loadtester starts with -i (interactive) and optionally -c (number of cores to use, in this case only 1 CPU core is used). I will be doing a loadtest with gigabit speeds only, so no significant CPU is needed. I will demonstrate below that one CPU core of this machine can generate (sink and source) approximately 72Gbit/s of traffic. The loadtest starts a controlport on :4501 which the client connects to. You can now program the loadtester (programmatically via an API, or via the commandline / CLI tool provided. Ill demonstrate both).
In trex-console, I first enter TUI mode -- this stands for the Traffic UI. Here, I can load a profile into the loadtester, and while you can write your own profiles, there are many standard ones to choose from. Theres further two types of loadtest, stateful and stateless. I started with a simpler stateless one first, take a look at `stl/imix.py` which is self explanatory, but in particular, the mix consists of:
```
self.ip_range = {'src': {'start': "16.0.0.1", 'end': "16.0.0.254"},
'dst': {'start': "48.0.0.1", 'end': "48.0.0.254"}}
# default IMIX properties
self.imix_table = [ {'size': 60, 'pps': 28, 'isg':0 },
{'size': 590, 'pps': 16, 'isg':0.1 },
{'size': 1514, 'pps': 4, 'isg':0.2 } ]
```
Above one can see that there will be traffic flowing from 16.0.0.1-254 to 48.0.0.1-254, and there will be three streams generated at a certain ratio, 28 small 60 byte packets, 16 medium sized 590b packets, and 4 large 1514b packets. This is typically what a residential user would see (a SIP telephone call, perhaps a Jitsi video stream; some download of data with large MTU-filling packets; and some DNS requests and other smaller stuff). Executing this profile can be done with:
```
tui> start -f stl/imix.py -m 1kpps
```
.. which will start a 1kpps load of that packet stream. The traffic load can be changed by either specifying an absolute packet rate, or a percentage of line rate, and you can pause and resume as well:
```
tui> update -m 10kpps
tui> update -m 10%
tui> update -m 50%
tui> pause
# do something, there will be no traffic
tui> resume
tui> update -m 100%
```
After this last command, T-Rex will be emitting line rate packets out of port0 and out of port1, and it will be expecting to see the packets that it sent back on port1 and port0 respectively. If the machine is powerful enough, it can saturate traffic up to the line rate in both directions. One can see if things are successfully passing through the device under test (in this case, for now simply a UTP cable from port0-port1). The ibytes should match the obytes, and of course ipackets should match the opackets in both directions. Typically, a loss rate of 0.01% is considered acceptable. And, typically, a loss rate of a few packets in the beginning of the loadtest is also acceptable (more on that later).
Screenshot of port0-port1 loopback test with L2:
```
Global Statistitcs
connection : localhost, Port 4501 total_tx_L2 : 1.51 Gbps
version : STL @ v2.88 total_tx_L1 : 1.98 Gbps
cpu_util. : 5.63% @ 1 cores (1 per dual port) total_rx : 1.51 Gbps
rx_cpu_util. : 0.0% / 0 pps total_pps : 2.95 Mpps
async_util. : 0% / 104.03 bps drop_rate : 0 bps
total_cps. : 0 cps queue_full : 0 pkts
Port Statistics
port | 0 | 1 | total
-----------+-------------------+-------------------+------------------
owner | root | root |
link | UP | UP |
state | TRANSMITTING | TRANSMITTING |
speed | 1 Gb/s | 1 Gb/s |
CPU util. | 5.63% | 5.63% |
-- | | |
Tx bps L2 | 755.21 Mbps | 755.21 Mbps | 1.51 Gbps
Tx bps L1 | 991.21 Mbps | 991.21 Mbps | 1.98 Gbps
Tx pps | 1.48 Mpps | 1.48 Mpps | 2.95 Mpps
Line Util. | 99.12 % | 99.12 % |
--- | | |
Rx bps | 755.21 Mbps | 755.21 Mbps | 1.51 Gbps
Rx pps | 1.48 Mpps | 1.48 Mpps | 2.95 Mpps
---- | | |
opackets | 355108111 | 355111209 | 710219320
ipackets | 355111078 | 355108226 | 710219304
obytes | 22761267356 | 22761466414 | 45522733770
ibytes | 22761457966 | 22761274908 | 45522732874
tx-pkts | 355.11 Mpkts | 355.11 Mpkts | 710.22 Mpkts
rx-pkts | 355.11 Mpkts | 355.11 Mpkts | 710.22 Mpkts
tx-bytes | 22.76 GB | 22.76 GB | 45.52 GB
rx-bytes | 22.76 GB | 22.76 GB | 45.52 GB
----- | | |
oerrors | 0 | 0 | 0
ierrors | 0 | 0 | 0
```
Instead of `stl/imix.py` as a profile, one can also consider `stl/udp_1pkt_simple_bdir.py` as a profile. This will send UDP packets of 0 bytes payload from a single host 16.0.0.1 to a single host 48.0.0.1 and back. Running the 1pkt UDP profile in both directions at gigabit link speeds will allow for 1.488Mpps in both directions (a minimum ethernet frame carrying IPv4 packet will be 672 bits in length -- see [wikipedia](https://en.wikipedia.org/wiki/Ethernet_Frame) for details).
Above, one can see the system is in a healthy state - it has saturated the network bandwidth in both directions (991Mps L1 rate, so this is the full 672 bits per ethernet frame, including the header, interpacket gap, etc), at 1.48Mpps. All packets sent by port0 (the opackets, obytes) should have been received by port1 (the ipackets, ibytes), and they are.
One can also learn that T-Rex is utilizing approximately 5.6% of one CPU core sourcing and sinking this load on the two gigabit ports (thats 2 gigabit out, 2 gigabit in), so for a DPDK application, **one CPU core is capable of 71Gbps and 53Mpps**, an interesting observation.
### Step 4 - Run T-Rex programmatically
I wrote a tool previously that allows to run a specific ramp-up profile from 1kpps warmup through to line rate, in order to find the maximum allowable throughput before a DUT exhibits too much loss, usage:
```
usage: trex-loadtest.py [-h] [-s SERVER] [-p PROFILE_FILE] [-o OUTPUT_FILE]
[-wm WARMUP_MULT] [-wd WARMUP_DURATION]
[-rt RAMPUP_TARGET] [-rd RAMPUP_DURATION]
[-hd HOLD_DURATION]
T-Rex Stateless Loadtester -- pim@ipng.nl
optional arguments:
-h, --help show this help message and exit
-s SERVER, --server SERVER
Remote trex address (default: 127.0.0.1)
-p PROFILE_FILE, --profile PROFILE_FILE
STL profile file to replay (default: imix.py)
-o OUTPUT_FILE, --output OUTPUT_FILE
File to write results into, use "-" for stdout
(default: -)
-wm WARMUP_MULT, --warmup_mult WARMUP_MULT
During warmup, send this "mult" (default: 1kpps)
-wd WARMUP_DURATION, --warmup_duration WARMUP_DURATION
Duration of warmup, in seconds (default: 30)
-rt RAMPUP_TARGET, --rampup_target RAMPUP_TARGET
Target percentage of line rate to ramp up to (default:
100)
-rd RAMPUP_DURATION, --rampup_duration RAMPUP_DURATION
Time to take to ramp up to target percentage of line
rate, in seconds (default: 600)
-hd HOLD_DURATION, --hold_duration HOLD_DURATION
Time to hold the loadtest at target percentage, in
seconds (default: 30)
```
Here, the loadtester will load a profile (imix.py for example), warmup for 30s at 1kpps, then ramp up linearly to 100% of line rate in 600s, and hold at line rate for 30s. The loadtest passes if during this entire time, the DUT had less than 0.01% packet loss. I must note that in this loadtest, I cannot ramp up to line rate (because the Atom86 link is used twice!), and Ill also note I cannot ramp up to 50% of line rate (because the loadtester is sending untagged traffic, but the Arista is adding tags onto the Atom86 link!), so I expect to see **711Kpps** which is just about **47% of line rate**.
The loadtester will emit a JSON file with all of its runtime stats, which can be later analyzed and used to plot graphs. First, lets look at an imix loadtest:
```
root@dcg-2:/tmp/loadtest# trex-loadtest.py -o ~/imix.json -p imix.py -rt 50
Running against 127.0.0.1 profile imix.py, warmup 1kpps for 30s, rampup target 50%
of linerate in 600s, hold for 30s output goes to /root/imix.json
Mapped ports to sides [0] <--> [1]
Warming up [0] <--> [1] at rate of 1kpps for 30 seconds
Setting load [0] <--> [1] to 1% of linerate
stats: 4.20 Kpps 2.82 Mbps (0.28% of linerate)
stats: 321.30 Kpps 988.14 Mbps (98.81% of linerate)
Loadtest finished, stopping
Test has passed :-)
Writing output to /root/imix.json
```
And then lets step up our game with a 64b loadtest:
```
root@dcg-2:/tmp/loadtest# trex-loadtest.py -o ~/64b.json -p udp_1pkt_simple_bdir.py -rt 50
Running against 127.0.0.1 profile udp_1pkt_simple_bdir.py, warmup 1kpps for 30s, rampup target 50%
of linerate in 600s, hold for 30s output goes to /root/64b.json
Mapped ports to sides [0] <--> [1]
Warming up [0] <--> [1] at rate of 1kpps for 30 seconds
Setting load [0] <--> [1] to 1% of linerate
stats: 4.20 Kpps 2.82 Mbps (0.28% of linerate)
stats: 1.42 Mpps 956.41 Mbps (95.64% of linerate)
stats: 1.42 Mpps 952.44 Mbps (95.24% of linerate)
WARNING: DUT packetloss too high
stats: 1.42 Mpps 955.19 Mbps (95.52% of linerate)
```
As an interesting note, this value 1.42Mpps is exactly what I calculated and expected (see above for a full explanation). The math works out at 10^9 / 704 bits/packet == **1.42Mpps**, just short of 1.488M line rate that I would have found had Coloclue not used VLAN tags.
### Step 5 - Run T-Rex ASTF, measure latency/jitter
In this mode, T-Rex simulates many stateful flows using a profile, which replays actual PCAP data (as can be obtained with tcpdump), by spacing out the requests and rewriting the source/destination addresses, thereby simulating hundreds or even millions of active sessions - I used `astf/http_simple.py` as a canonical example. In parallel to the test, I let T-Rex run a latency check, by sending SCTP packets at a rate of 1KHz from each interface. By doing this, latency profile and jitter can be accurately measured under partial or full line load.
#### Bandwidth/Packet rate
Lets first take a look at the bandwidth and packet rates:
```
Global Statistitcs
connection : localhost, Port 4501 total_tx_L2 : 955.96 Mbps
version : ASTF @ v2.88 total_tx_L1 : 972.91 Mbps
cpu_util. : 5.64% @ 1 cores (1 per dual port) total_rx : 955.93 Mbps
rx_cpu_util. : 0.06% / 2 Kpps total_pps : 105.92 Kpps
async_util. : 0% / 63.14 bps drop_rate : 0 bps
total_cps. : 3.46 Kcps queue_full : 143,837 pkts
Port Statistics
port | 0 | 1 | total
-----------+-------------------+-------------------+------------------
owner | root | root |
link | UP | UP |
state | TRANSMITTING | TRANSMITTING |
speed | 1 Gb/s | 1 Gb/s |
CPU util. | 5.64% | 5.64% |
-- | | |
Tx bps L2 | 17.35 Mbps | 938.61 Mbps | 955.96 Mbps
Tx bps L1 | 20.28 Mbps | 952.62 Mbps | 972.91 Mbps
Tx pps | 18.32 Kpps | 87.6 Kpps | 105.92 Kpps
Line Util. | 2.03 % | 95.26 % |
--- | | |
Rx bps | 938.58 Mbps | 17.35 Mbps | 955.93 Mbps
Rx pps | 87.59 Kpps | 18.32 Kpps | 105.91 Kpps
---- | | |
opackets | 8276689 | 39485094 | 47761783
ipackets | 39484516 | 8275603 | 47760119
obytes | 978676133 | 52863444478 | 53842120611
ibytes | 52862853894 | 978555807 | 53841409701
tx-pkts | 8.28 Mpkts | 39.49 Mpkts | 47.76 Mpkts
rx-pkts | 39.48 Mpkts | 8.28 Mpkts | 47.76 Mpkts
tx-bytes | 978.68 MB | 52.86 GB | 53.84 GB
rx-bytes | 52.86 GB | 978.56 MB | 53.84 GB
----- | | |
oerrors | 0 | 0 | 0
ierrors | 0 | 0 | 0
```
In the above screen capture, one can see the traffic out of port0 is 20Mbps at 18.3Kpps, while the traffic out of port1 is 952Mbps at 87.6Kpps - this is because the clients are sourcing from port0, while the servers are simulated behind port1. Note the asymmetric traffic flow, T-Rex is using 972Mbps of total bandwidth over this 1Gbps VLAN, and a tiny bit more than that on the Atom86 link, because the Aristas are inserting VLAN tags in transit, to be exact, 18.32+87.6 = 105.92Kpps worth of 4 byte tags, thus 3.389Mbit extra traffic.
#### Latency Injection
Now, lets look at the latency in both directions, depicted in microseconds, at a throughput of 106Kpps (975Mbps):
```
Global Statistitcs
connection : localhost, Port 4501 total_tx_L2 : 958.07 Mbps
version : ASTF @ v2.88 total_tx_L1 : 975.05 Mbps
cpu_util. : 4.86% @ 1 cores (1 per dual port) total_rx : 958.06 Mbps
rx_cpu_util. : 0.05% / 2 Kpps total_pps : 106.15 Kpps
async_util. : 0% / 63.14 bps drop_rate : 0 bps
total_cps. : 3.47 Kcps queue_full : 143,837 pkts
Latency Statistics
Port ID: | 0 | 1
-------------+-----------------+----------------
TX pkts | 244068 | 242961
RX pkts | 242954 | 244063
Max latency | 23983 | 23872
Avg latency | 815 | 702
-- Window -- | |
Last max | 966 | 867
Last-1 | 948 | 923
Last-2 | 945 | 856
Last-3 | 974 | 880
Last-4 | 963 | 851
Last-5 | 985 | 862
Last-6 | 986 | 870
Last-7 | 946 | 869
Last-8 | 976 | 879
Last-9 | 964 | 867
Last-10 | 964 | 837
Last-11 | 970 | 867
Last-12 | 1019 | 897
Last-13 | 1009 | 908
Last-14 | 1006 | 897
Last-15 | 1022 | 903
Last-16 | 1015 | 890
--- | |
Jitter | 42 | 45
---- | |
Errors | 3 | 2
```
In the capture above, one can see the total latency measurement packets sent, and the latency measurements in microseconds. One can see that from port0->port1 the measured latency was 0.815ms, while the latency in the other direction was 0.702ms. The discrepancy can be explained by the HTTP traffic being asymmetric (clients on port0 have to send their SCTP packets into a much busier port1), which creates queuing latency on the wire and NIC. The `Last-*` lines under it are the values of the last 16 seconds of measurements. The maximum observed latency was 23.9ms in one direction and 23.8ms in the other direction. I have to conclude therefore that the Atom86 line, even under stringent load, does not suffer from outliers in the entire 300s duration of my loadtest.
Jitter is defined as a variation in the delay of received packets. At the sending side, packets are sent in a continuous stream with the packets spaced evenly apart. Due to network congestion, improper queuing, or configuration errors, this steady stream can become lumpy, or the delay between each packet can vary instead of remaining constant. There was **virtually no jitter**: 42 microseconds in one direction, 45us in the other.
#### Latency Distribution
While performing this test at 106Kpps (975Mbps), it's also useful to look at the latency distribution as a histogram:
```
Global Statistitcs
connection : localhost, Port 4501 total_tx_L2 : 958.07 Mbps
version : ASTF @ v2.88 total_tx_L1 : 975.05 Mbps
cpu_util. : 4.86% @ 1 cores (1 per dual port) total_rx : 958.06 Mbps
rx_cpu_util. : 0.05% / 2 Kpps total_pps : 106.15 Kpps
async_util. : 0% / 63.14 bps drop_rate : 0 bps
total_cps. : 3.47 Kcps queue_full : 143,837 pkts
Latency Histogram
Port ID: | 0 | 1
-------------+-----------------+----------------
20000 | 2545 | 2495
10000 | 5889 | 6100
9000 | 456 | 421
8000 | 874 | 854
7000 | 692 | 757
6000 | 619 | 637
5000 | 985 | 994
4000 | 579 | 620
3000 | 547 | 546
2000 | 381 | 405
1000 | 798 | 697
900 | 27451 | 346
800 | 163717 | 22924
700 | 102194 | 154021
600 | 24623 | 171087
500 | 24882 | 40586
400 | 18329 |
300 | 26820 |
```
In the capture above, one can see the number of packets observed between certain ranges; from port0 to port1, 102K SCTP latency probe packets were in transit some time between 700-799us, 163K probes were between 800-899us. In the other direction, 171K probes were between 600-699us and 154K probes were between 700-799us. This is corroborated by the mean latency I saw above (815us from port0->port1 and 702us from port1->port0).

View File

@ -0,0 +1,156 @@
---
date: "2021-02-27T23:46:12Z"
title: IPng Network
---
# Introduction to IPng Networks
At IPng Networks, we run a modest network with European reach. With our home base
in Zurich, Switzerland, we are pretty well connected into the Swiss internet scene.
We operate four sites in Zurich, and an additional set of sites in European cities,
each of which are described on this post. If you're curious as to how the network
runs, you can find two main pieces here: Firstly, the physical parts, where exactly
are IPng's routers and switches, what types of kit does the ISP use, and so on.
Secondly, the logical parts, what operating systems and configurations are in use.
## Physical
### Zurich Metropolitan Area
The Canton of Zurich, Switzerland is our home-base, and it's where IPng
Networks GmbH is registered. The local commercial datacenter scene is dominated
by Interxion, NTT and Equinix. The small town of Br&uuml;ttisellen (zipcode
CH-8306), is where our founder lives and, due to the ongoing Corona pandemic,
where he works from home.
{{< image width="400px" float="left" src="/assets/network/zurich-ring.png" alt="Zurich Metro" >}}
In Br&uuml;ttisellen, marked with **C**, we have our first two routers,
`chbtl0.ipng.ch` and `chbtl1.ipng.ch`, racked in our office. There are only two
fiber operators in this town - UPC and Swisscom. The orange trace (**C** to **D**)
is a leased line from UPC, which we rent from [Openfactory](https://openfactory.net/)
and it gets terminated at Interxion Glattbrugg, where our first router
called `chgtg0.ipng.ch` is located. From there, Openfactory rents darkfiber
to multiple locations - but notably the dark purple trace (**D** to **E**)
that connects from Interxion Glattbrugg to NTT R&uuml;mlang, where our second
router called `chrma0.ipng.ch` is located.
We rent a 10G CWDM wave between these two datacenters, directly connecting these
two routers. Now, Equinix also has a sizable footprint in Z&uuml;rich, and
operating ZH04 (**B** where we only have passive optical presence) in the
Industriekwartier (our local internet exchange [SwissIX](https://swissix.net/)
was born in the now defunct Equinix ZH01 office building). From the neighboring
building Equinix ZH04, our partner [IP-Max](https://ip-max.net/) rents dark fiber
to Equinix ZH05 in the Zurich Allmend area (the light purple trace **B** to **F**),
and from there, IP-Max rents dark fiber to NTT R&uuml;mlang again (**F** to **E**),
completing the ring. We rent a 10G circuit on that path, to redundantly connect our
routers `chgtg0` and `chrma0`. If at any time we'd need to connect partners
or customers, we can do so at a moment's notice, as rackspace is available in
all Equinix sites for IPng Networks.
The green link (**D** to **B**) is a 10G carrier ethernet circuit between Interxion,
over the light purple path (**B** to **A**) on its last mile to Albisrieden, where
we built a very small colocation site, which you can read about in more detail in our
[informational post]({% post_url 2022-02-24-colo %}) - the colo is open for private
individuals and small businesses ([contact](/s/contact/) us for details!).
### European Ring
At IPng, we are strong believers in a free and open Internet. Having seen
the shakeout of internet backbone providers over the last two decades, it
seems to be a race to the bottom, with mergers, acquisitions and takeovers
of datacenters and network carriers. Prices are going lower, and small fish
traffic (let's be honest, IPng Networks is definitely a small provider), to
the point that purchasing IP transit is cheaper than connecting to local
Internet exchange points. We've decided specifically to go the extra mile,
quite literally, and plot a path to several continental european internet
hubs.
{{< image width="400px" float="left" src="/assets/network/european-ring.png" alt="European Ring" >}}
***Frankfurt*** - Connected from NTT's datacenter at R&uuml;mlang (Zurich) with
a first 10G circuit, and from Interxion's datacenter at Glattbrugg (Zurich)
with a second 10G circuit, this is our first hop into the world. Here, we
connect to [DE-CIX](https://de-cix.net/) from Equinix FR5 at the Kleyerstrasse.
More details in our post [IPng Arrives in Frankfurt]({% post_url 2021-05-17-frankfurt %}).
***Amsterdam*** - The Amsterdam Science Park is where European Internet was born.
[NIKHEF](https://www.nikhef.nl/) is where we rent rackspace that connects with a 10G
circuit to Frankfurt, and a 10G circuit onwards towards Lille. We connect to
[Speed-IX](https://speed-ix.net/), [LSIX](https://lsix.net/), [NL-IX](https://nl-ix.net),
and an exchange point we help run called [FrysIX](https://www.frys-ix.net/).
More details in our post [IPng Arrives in Amsterdam]({% post_url 2021-05-26-amsterdam %}).
***Lille*** - [IP-Max](https://ip-max.net/) does lots of business in this
region, with presence in both local datacenters here, one in Lille and one in
Anzin. IPng has a point of presence here too, at the [CIV1](https://www.civ.fr/)
facility, with a northbound 10G circuit to Amsterdam, and a southbound 10G
circuit to Paris. Here, we connect to [LillIX](https://lillix.fr/).
More details in our post [IPng Arrives in Lille]({% post_url 2021-05-28-lille %}).
***Paris*** - Where two large facilities are placed back-to-back in the middle
of the city, originally Telehouse TH2, with a new facility at L&eacute;on Frot,
where we pick up a 10G circuit from Lille and further on the ring with a 10G
circuit to Geneva. Here, we connect to [FranceIX](https://franceix.net).
More details in our post [IPng Arrives in Paris]({% post_url 2021-06-01-paris %}).
***Geneva*** - The home-base of [IP-Max](https://ip-max.net) is where we close
our ring. From Paris, IP-Max has two redundant paths back to Switzerland, the first
being a DWDM link from to Zurich, and the second being a DWDM link to Lyon and
then into Geneva. Here, at [SafeHost](https://safehost.com/) in Plan les Ouates,
is where we have our fourth Swiss point of presence, with a connection to our very
own [Free-IX](https://free-ix.net/) and a 10G circuit to Interxion at Glattbrugg
(Zurich), and of course to Paris.
More details in our post [IPng Arrives in Geneva]({% post_url 2021-07-03-geneva %}).
## Logical
As a small operator, we'd love to be able to boast the newest Juniper [PTX10016](https://www.juniper.net/us/en/products/routers/ptx-series.html)
routers but we neither have the rack space, the power budget, and to be
perfectly honest, the monetary budget to run these at IPng Networks. But it
turns out, we know a fair bit about hardware silicon, architecture and the
controlplane software running on commercial routers.
We've decided to go a different route. In our opinion, at speeds under 100Gbit,
it's perfectly viable to use software routers on off-the-shelf hardware, notably
Intel network cards and CPUs, notably those that have support for the
[Dataplane Development Kit](https://dpdk.org/) (aka DPDK), which offers libraries
to accelerate packet processing workloads, which turn ordinary servers into very
performant routers. Two notable applications are [VPP](https://fd.io/) and
[Danos](https://danosproject.org).
### VPP
VPP originally comes from the house of Cisco [[ref](https://www.cisco.com/c/dam/m/en_us/service-provider/ciscoknowledgenetwork/files/592_05_25-16-fdio_is_the_future_of_software_dataplanes-v2.pdf)] and looks quite a bit like
the commercial ASR9k platform. In development since 2002, VPP is production
code currently running in shipping products. It runs in user space on multiple
architectures including x86, ARM, and Power architectures on both x86 servers
and embedded devices. The design of VPP is hardware, kernel, and deployment
(bare metal, VM, container) agnostic. It runs completely in userspace.
We've contributed a little bit to the Control Plane abstraction [[ref](https://docs.fd.io/vpp/21.06/dc/d2e/clicmd_src_plugins_linux-cp.html)],
which allows users to combine the throughput of a dataplane with usual routing
software like [Bird](https://bird.network.cz/) or [FRR](https://frrouting.org/).
We've been running it in production since December 2020 on `chbtl1.ipng.ch`.
It's our ultimate goal to run VPP and Linux Control Plane on the entire network,
as the design and architecture really resonates with us as software and systems
engineers.
### DANOS
The Disaggregated Network Operating System (DANOS) project originally comes
from AT&Ts “dNOS” software framework and provides an open, cost-effective and
flexible alternative to traditional networking equipment. As part of The Linux
Foundation, it now incorporates contributions from complementary open source
communities in building a standardized distributed Network Operating System (NOS)
to speed the adoption and use of white boxes in a service providers
infrastructure.
We've been using DANOS since its first release in August 2019, and it's
currently our routing platform of choice -- it combines the sheer speed of
DPDK with a [Vyatta](https://en.wikipedia.org/wiki/Vyatta) command line
interface. As an appliance, care was taken to complete the _whole package_,
with SNMP, YANG interface, image and upgrade management, interface monitoring
with wireshark semantics, et cetera. Performing easily at wire speed 10G
workloads (including 64byte ethernet frames), and being completely open source,
it fits very well with our philosophy of an open and free internet.

View File

@ -0,0 +1,685 @@
---
date: "2021-03-27T11:33:00Z"
title: 'Case Study: VPP at Coloclue, part 1'
---
* Author: Pim van Pelt <[pim@ipng.nl](mailto:pim@ipng.nl)>
* Reviewers: Coloclue Network Committee <[routers@coloclue.net](mailto:routers@coloclue.net)>
* Status: Draft - Review - **Published**
## Introduction
Coloclue AS8283 operates several Linux routers running Bird. Over the years, the performance of their previous hardware platform (Dell R610) has deteriorated, and they were up for renewal. At the same time, network latency/jitter has been very high, and variability may be caused by the Linux router hardware, their used software, the inter-datacenter links, or any combination of these. The routers were replaced with relatively modern hardware. In a [previous post]({% post_url 2021-02-27-coloclue-loadtest %}), I looked into the links between the datacenters, and demonstrated that they are performing as expected (1.41Mpps of 802.1q ethernet frames in both directions). That leaves the software. This post explores a replacement of the Linux kernel routers by a userspace process running VPP, which is an application built on DPDK.
### Executive Summary
I was unable to run VPP due to an issue detecting and making use of the Intel x710 network cards in this chassis. While the Intel i210-AT cards worked well, both with the standard `vfio-pci` driver and with an alternative `igb_uio` driver, I did not manage to get the Intel x710 cards to fully work (noting that I have the same Intel x710 NIC working flawlessly in VPP on another Supermicro chassis). See below for a detailed writeup of what I tried and which results were obtained. In the end, I reverted the machine back to its (mostly) original state, with three pertinent changes:
1. I left the Debian Backports kernel 5.10 running
1. I turned on IOMMU (Intel VT-d was already on), booting with `iommu=pt intel_iommu=on`
1. I left Hyperthreading off in the BIOS (it was on when I started)
After I restored the machine to its original Linux+Bird configuration, I noticed a marked improvement in latency, jitter and throughput. A combination of these changes is likely beneficial, so I do recommend making this change on all Coloclue routers, while we continue our quest for faster, more stable network performance.
So the bad news is: I did not get to prove that VPP and DPDK are awesome in AS8283. Yet.
But the good news is: **network performance improved** drastically. I'll take it :)
### Timeline
| | |
| ---- | ---- |
| {{< image src="/assets/coloclue-vpp/image0.png" alt="AS15703" width="450px" >}} | {{< image src="/assets/coloclue-vpp/image1.png" alt="AS12859" width="450px" >}} |
The graph on the left shows latency from AS15703 (True) in EUNetworks to a Coloclue machine hosted in NorthC. As far as Smokeping is concerned, latency has been quite poor for as long as it can remember (at least a year). The graph on the right shows the latency from AS12859 (BIT) to the beacon on `185.52.225.1/24` which is announced only on dcg-1, on the day this project was carried out.
Looking more closely at the second graph:
**Sunday 07:30**: The machine was put into maintenance, which made the latency jump. This is because the beacon was no longer reachable directly behind dcg-1 from AS12859 over NL-IX, but via an alternative path which traversed several more Coloclue routers, hence higher latency and jitter/loss.
**Sunday 11:00**: I rolled back the VPP environment on the machine, restoring it to its original configuration, except running kernel 5.10 and with Intel VT-d and Hyperthreading both turned off in the BIOS. A combination of those changes has definitely worked wonders. See also the `mtr` results down below.
**Sunday 14:50**: Because I didn't want to give up, and because I expected a little more collegiality from my friend dcg-1, I gave it another go by enabling IOMMU and PT, booting the 5.10 kernel with `iommu=pt` and `intel_iommu=on`. Now, with the `igb_uio` driver loaded, VPP detected both the i210 and x710 NICs, however it did not want to initialize the 4th port on the NIC (this was `enp1s0f3`, the port to Fusix Networks), and the port `eno1` only partially worked (IPv6 was fine, IPv4 was not). During this second attempt though, the rest of VPP and Bird came up, including NL-IX, the LACP, all internal interfaces, IPv4 and IPv6 OSPF and all BGP peering sessions with members.
**Sunday 16:20**: I could not in good faith turn on eBGP peers though, because of the interaction with `eno1` and `enp1s0f3` described in more detail below. I then ran out of time, and restored service with Linux 5.10 kernel and the original Bird configuration, now with Intel VT-d turned on and IOMMU/PT enabled in the kernel.
### Quick Overview
This paper, at a high level, discusses the following:
1. Gives a brief introduction of VPP and its new Linux CP work
1. Discusses a means to isolate a /24 on exactly one Coloclue router
1. Demonstrates changes made to run VPP, even though they were not applied
1. Compares latency/throughput before-and-after in a surprising improvement, unrelated to VPP
## 1. Introduction to VPP
VPP stands for _Vector Packet Processing_. In development since 2002, VPP is production code currently running in shipping products. It runs in user space on multiple architectures including x86, ARM, and Power architectures on both x86 servers and embedded devices. The design of VPP is hardware, kernel, and deployment (bare metal, VM, container) agnostic. It runs ***completely in userspace***. VPP helps push extreme limits of performance and scale. Independent testing shows that, at scale, VPP-powered routers are two orders of magnitude faster than currently available technologies.
The Linux (and BSD) kernel is not optimized for network I/O. Each packet (or in some implementations, a small batch of packets) generates an interrupt which causes the kernel to stop what it's doing, schedule the interrupt handler, do the necessary steps in the networking stack for each individual packet in turn: layer2 input, filtering, NAT session matching and packet rewriting, IP next-hop lookup, interface and L2 next-hop lookup, and marshalling the packet back onto the network, or handing it over to an application running on the local machine. And it does this for each packet one after another.
VPP takes away a few inefficiencies in this process in a few ways:
* VPP does not use interrupts, does not use the kernel network driver, and does not use the kernel networking stack at all. Instead, it attaches directly to the PCI device and polls the network card directly for incoming packets.
* Once network traffic gets busier, VPP constructs a _collection of packets_ called a _vector_, to pass through a directed graph of smaller functions. There's a clear performance benefit of such an architecture: the first packet from the vector will hit possibly a cold instruction/data cache in the CPU, but the second through Nth packet from the vector will execute on a hot cache and not need most/any memory access, executing at an order of magnitude faster or even better.
* VPP is multithreaded and can have multiple cores polling and executing receive and transmit queues for network interfaces at the same time. Routing information (like next hops, forwarding tables, etc) should be carefully maintained, but in principle, VPP linearly scales with the amount of cores.
It is straight forward to obtain 10Mpps of forwarding throughput per CPU core, so a 32 core machine (handling 320Mpps) can realistically saturate 21x10Gbit interfaces (at 14.88Mpps). A similar 32-core machine, if it has sufficient amounts of PCI slots and network cards can route an internet mixture of traffic at throughputs of roughly 492Gbit (320Mpps at 650Kpps per 10G of imix).
VPP, upon startup, will disassociate the NICs with the kernel and bind them into the `vpp` process, which will promptly run at 100% CPU, due to its DPDK polling. There's a tool `vppcli` which allows the operator to configure the VPP process: create interfaces, set attributes like link state, MTU, MPLS, Bonding, IPv4/IPv6 addresses and add/remove routes in the _forwarding information base_ (or FIB). VPP further works with plugins, that add specific functionality, examples of this is LLDP, DHCP, IKEv2, NAT, DSLITE, Load Balancing, Firewall ACLs, GENEVE, VXLAN, VRRP, and Wireguard, to name but a few popular ones.
### Introduction to Linux CP Plugin
However, notably (or perhaps notoriously), VPP is only a dataplane application, it does not have any routing protols like `OSPF` or `BGP`. A relatively new plugin is called the Linux Control Plane (or LCP), and it consists of two parts, one is public and one is under development at the time of this article. The first plugin allows the operator to create a Linux `tap` interface and pass though or _punt_ traffic from the dataplane into it. This way, the userspace VPP application creates a link back into the kernel, and an interface (eg. `vpp0`) appears. Input packets in VPP have all input features (firewall, NAT, session matching, etc), and if the packet is sent to an IP address with an LCP pair associated with it, it is punted to the `tap` device. So if on the Linux side, the same IP address is put on the resulting `vpp0` device, Linux will see it. Responses from the kernel into the `tap` device are picked up by the Linux CP plugin and re-injected into the dataplane, and all output features of VPP are applied. This makes bidirectional traffic possible. You can read up on the Linux CP plugin in the [VPP documentation](https://docs.fd.io/vpp/21.06/d6/ddb/lcp_8api.html).
Here's a barebones example of plumbing the VPP interface `GigabitEthernet7/0/0` through a network device `vpp0` in the `dataplane` network namespace.
```
pim@vpp-west:~$ sudo systemctl restart vpp
pim@vpp-west:~$ vppctl lcp create GigabitEthernet7/0/0 host-if vpp0 namespace dataplane
pim@vpp-west:~$ sudo ip netns exec dataplane ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
12: vpp0: <BROADCAST,MULTICAST> mtu 9000 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether 52:54:00:8a:0e:97 brd ff:ff:ff:ff:ff:ff
pim@vpp-west:~$ vppctl show interface
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
GigabitEthernet7/0/0 1 down 9000/0/0/0
local0 0 down 0/0/0/0
tap1 2 up 9000/0/0/0
```
### Introduction to Linux NL Plugin
You may be wondering, what happens with interface addresses or static routes? Usually, a userspace application like `ip link add` or `ip address add` or a higher level process like `bird` or `FRR` will want to set routes towards next hops upon interfaces using routing protocols like `OSPF` or `BGP`. The Linux kernel picks these events up and can share them as so called `netlink` messages with interested parties. Enter the second plugin (the one that is under development at the moment), which is a netlink listener. Its job is to pick up netlink messages from the kernel and apply them to the VPP dataplane. With the Linux NL plugin enabled, events like adding or removing links, addresses, routes, set linkstate or MTU, will all mirrored into the dataplane. I'm hoping the netlink code will be released in the upcoming VPP release, but contact me any time if you'd like to discuss details of the code, which can be found currently under community review in the [VPP Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)
Building on the example above, with this Linux NL plugin enabled, we can now manipulate VPP state from Linux, for example creating an interface and adding an IPv4 address to it (of course, IPv6 works just as well!):
```
pim@vpp-west:~$ sudo ip netns exec dataplane ip link set vpp0 up mtu 1500
pim@vpp-west:~$ sudo ip netns exec dataplane ip addr add 2001:db8::1/64 dev vpp0
pim@vpp-west:~$ sudo ip netns exec dataplane ip addr add 10.0.13.2/30 dev vpp0
pim@vpp-west:~$ sudo ip netns exec dataplane ping -c1 10.0.13.1
PING 10.0.13.1 (10.0.13.1) 56(84) bytes of data.
64 bytes from 10.0.13.1: icmp_seq=1 ttl=64 time=0.591 ms
--- 10.0.13.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.591/0.591/0.591/0.000 ms
pim@vpp-west:~$ vppctl show interface
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
GigabitEthernet7/0/0 1 up 1500/0/0/0 rx packets 4
rx bytes 268
tx packets 14
tx bytes 1140
drops 2
ip4 2
local0 0 down 0/0/0/0
tap1 2 up 9000/0/0/0 rx packets 10
rx bytes 796
tx packets 2
tx bytes 140
ip4 1
ip6 8
pim@vpp-west:~$ vppctl show interface address
GigabitEthernet7/0/0 (up):
L3 10.0.13.2/30
L3 2001:db8::1/64
local0 (dn):
tap1 (up):
```
As can be seen above, setting the link state up, setting the MTU, adding an address were all captured by the Linux NL plugin and applied in the dataplane. Further to this, the Linux NL plugin also synchronizes route updates into the _forwarding information base_ (or FIB) of the dataplane:
```
pim@vpp-west:~$ sudo ip netns exec dataplane ip route add 100.65.0.0/24 via 10.0.13.1
pim@vpp-west:~$ vppctl show ip fib 100.65.0.0
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ]
100.65.0.0/24 fib:0 index:15 locks:2
lcp-rt refs:1 src-flags:added,contributing,active,
path-list:[27] locks:2 flags:shared, uPRF-list:19 len:1 itfs:[1, ]
path:[34] pl-index:27 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
10.0.13.1 GigabitEthernet7/0/0
[@0]: ipv4 via 10.0.13.1 GigabitEthernet7/0/0: mtu:1500 next:5 flags:[] 52540015f82a5254008a0e970800
```
***Note***: I built the code for VPP v21.06 including the Linux CP and Linux NL plugins at tag `21.06-rc0~476-g41cf6e23d` on Debian Buster for the rest of this project, to match the operating system in use on Coloclue routers. I did this without additional modifications (even though I must admit, I do know of a few code paths in the netlink handler that still trigger a crash, and I have a few fixes in my client at home, so I'll be careful to avoid the pitfalls for now :-).
## 2. Isolating a Device Under Test
Coloclue has several routers, so to ensure that the traffic traverses only the one router under test, I decided to use an allocated but currently unused IPv4 prefix and announce that only from one of the four routers, so that all traffic to and from that /24 goes over that router. Coloclue uses a piece of software called [Kees](https://github.com/coloclue/kees.git), a set of Python and Jinja2 scripts to generate a Bird1.6 configuration for each router. This is great because that allows me to add a small feature to get what I need: **beacons**.
A beacon is a prefix that is sent to (some, or all) peers on the internet to attract traffic in a particular way. I added a function called `is_coloclue_beacon()` which reads the input YAML file and uses a construction similar to the existing feature for "supernets". It determines if a given prefix must be announced to peers and upstreams. Any IPv4 and IPv6 prefixes from the `beacons` list will be then matched in `is_coloclue_beacon()` and announced.
Based on a per-router config (eg. `vars/dcg-1.router.nl.coloclue.net.yml`) I can now add the following YAML stanza:
```
coloclue:
beacons:
- prefix: "185.52.225.0"
length: 24
comment: "VPP test prefix (pim)"
```
Because tinkering with routers in the _Default Free Zone_ is a great way to cause an outage, I need to ensure that the code I wrote was well tested. I first ran `./update-routers.sh check` with no beacon config. This succeeded:
```
[...]
checking: /opt/router-staging/dcg-1.router.nl.coloclue.net/bird.conf
checking: /opt/router-staging/dcg-1.router.nl.coloclue.net/bird6.conf
checking: /opt/router-staging/dcg-2.router.nl.coloclue.net/bird.conf
checking: /opt/router-staging/dcg-2.router.nl.coloclue.net/bird6.conf
checking: /opt/router-staging/eunetworks-2.router.nl.coloclue.net/bird.conf
checking: /opt/router-staging/eunetworks-2.router.nl.coloclue.net/bird6.conf
checking: /opt/router-staging/eunetworks-3.router.nl.coloclue.net/bird.conf
checking: /opt/router-staging/eunetworks-3.router.nl.coloclue.net/bird6.conf
```
And I made sure that the generated function is indeed empty:
```
function is_coloclue_beacon()
{
# Prefix must fall within one of our supernets, otherwise it cannot be a beacon.
if (!is_coloclue_more_specific()) then return false;
return false;
}
```
Then, I ran the configuration again with one IPv4 beacon set on dcg-1, and still all the bird configs on both IPv4 and IPv6 for all routers parsed correctly, and the generated function on the dcg-1 IPv4 filters file was popupated:
```
function is_coloclue_beacon()
{
# Prefix must fall within one of our supernets, otherwise it cannot be a beacon.
if (!is_coloclue_more_specific()) then return false;
if (net = 185.52.225.0/24) then return true; /* VPP test prefix (pim) */
return false;
}
```
I then wired up the function into `function ebgp_peering_export()` and submitted the beacon configuration above, as well as a static route for that beacon prefix to a server running in the NorthC (previously called DCG) datacenter. You can read the details in this [Kees commit](https://github.com/coloclue/kees/commit/3710f1447ade10384c86f35b2652565b440c6aa6). The dcg-1 router is connected to [NL-IX](https://nl-ix.net/), so it's expected that after this configuration went live, peers can now see that prefix only via NL-IX, and it's a more specific to the overlapping supernet (which is `185.52.224.0/22`).
And indeed, a traceroute now only traverses dcg-1 as seen from peer BIT (AS12859 coming from NL-IX):
```
1. lo0.leaf-sw4.bit-2b.network.bit.nl
2. lo0.leaf-sw6.bit-2a.network.bit.nl
3. xe-1-3-1.jun1.bit-2a.network.bit.nl
4. coloclue.the-datacenter-group.nl-ix.net
5. vpp-test.ams.ipng.ch
```
As well as return traffic from Coloclue to that peer:
```
1. bond0-100.dcg-1.router.nl.coloclue.net
2. bit.bit2.nl-ix.net
3. lo0.leaf-sw6.bit-2a.network.bit.nl
4. lo0.leaf-sw4.bit-2b.network.bit.nl
5. sandy.ipng.nl
```
## 3. Installing VPP
First, I need to ensure that the machine is reliably reachable via its IPMI interface (normally using serial-over-lan, but to make sure as well Remote KVM). This is required because all network interfaces above will be bound by VPP, and if the `vpp` process ever were to crash, it will be restarted without configuration. On a production router, one would expect there to be a configuration daemon that can persist a configuration and recreate it in case of a server restart or dataplane crash.
Before we start, let's build VPP with our two beautiful plugins, copy them to dcg-1, and install all the supporting packages we'll need:
```
pim@vpp-builder:~/src/vpp$ make install-dep
pim@vpp-builder:~/src/vpp$ make build
pim@vpp-builder:~/src/vpp$ make build-release
pim@vpp-builder:~/src/vpp$ make pkg-deb
pim@vpp-builder:~/src/vpp$ dpkg -c build-root/vpp-plugin-core*.deb | egrep 'linux_(cp|nl)_plugin'
-rw-r--r-- root/root 92016 2021-03-27 12:06 ./usr/lib/x86_64-linux-gnu/vpp_plugins/linux_cp_plugin.so
-rw-r--r-- root/root 57208 2021-03-27 12:06 ./usr/lib/x86_64-linux-gnu/vpp_plugins/linux_nl_plugin.so
pim@vpp-builder:~/src/vpp$ scp build-root/*.deb root@dcg-1.nl.router.coloclue.net:/root/vpp/
pim@dcg-1:~$ sudo apt install libmbedcrypto3 libmbedtls12 libmbedx509-0 libnl-3-200 \
libnl-route-3-200 libnuma1 python3-cffi python3-cffi-backend python3-ply python3-pycparser
pim@dcg-1:~$ sudo dpkg -i /root/vpp/*.deb
pim@dcg-1:~$ sudo usermod -a -G vpp pim
```
On a BGP speaking router, `netlink` messages can come in rather quickly as peers come and go. Due to an unfortunate design choice in the Linux kernel, messages are not buffered for clients, which means that a buffer overrun can occur. To avoid this, I'll raise the netlink socket size to 64MB, leverging a feature that will create a producer queue in the Linux NL plugin, so that VPP can try to drain the messages from the kernel into its memory as quickly as possible. To be able to raise the netlink socket buffer size, we need to set some variables with `sysctl` (take note as well on the usual variables VPP wants to set with regards to [hugepages](https://wiki.debian.org/Hugepages) in `/etc/sysctl.d/80-vpp.conf`, which the Debian package installs for you):
```
pim@dcg-1:~$ cat << EOF | sudo tee /etc/sysctl.d/81-vpp-netlink.conf
# Increase netlink to 64M
net.core.rmem_default=67108864
net.core.wmem_default=67108864
net.core.rmem_max=67108864
net.core.wmem_max=67108864
EOF
pim@dcg-1:~$ sudo sysctl -p /etc/sysctl.d/81-vpp-netlink.conf /etc/sysctl.d/80-vpp.conf
```
### VPP Configuration
Now that I'm sure traffic to and from `185.52.225.0/24` will go over dcg-1, let's take a look at the machine itself. It has a six network cards, two onboard Intel i210 gigabit and one Intel x710-DA4 quad-tengig network cards. To run VPP, the network cards in the machine need to be supported in [Intel's DPDK](https://www.dpdk.org/) libraries. The ones in this machine are all OK (but as we'll see later, problematic for unexplained reasons):
```
root@dcg-1:~# lspci | grep Ether
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
01:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
01:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
06:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
07:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
```
To handle the inbound traffic, netlink messages and other internal memory structure, I'll allocate 2GB of _hugepages_ to the VPP process. I'll then of course enable the two Linux CP plugins. Because VPP has a lot of statistics counters (for example, a few stats for each used prefix in its _forwarding information base_ or FIB), I will need to give it more than the default of 32MB of stats memory. I'd like to execute a few startup commands to further configure the VPP runtime upon startup, so I'll add a startup-config stanza. Finally, although on a production router I would, here I will not specify the DPDK interfaces, because I know that VPP will take over any supported network card that is in link down state upon startup. As long as I boot the machine with unconfigured NICs, I will be good.
So, here's the configuration I end up adding to `/etc/vpp/startup.conf`:
```
unix {
startup-config /etc/vpp/vpp-exec.conf
}
memory {
main-heap-size 2G
main-heap-page-size default-hugepage
}
plugins {
path /usr/lib/x86_64-linux-gnu/vpp_plugins
plugin linux_cp_plugin.so { enable }
plugin linux_nl_plugin.so { enable }
}
statseg {
size 128M
}
# linux-cp {
# default netns dataplane
# }
```
***Note***: It is important to isolate the `tap` devices into their own Linux network namespace. If this is not done, packets arriving via the dataplane will not have a route up and into the kernel for interfaces VPP is not aware of, making those kernel-enabled interfaces unreachable. Due to the use of a network namespace, all applications in Linux will have to be run in that namespace (think: `bird`, `sshd`, `snmpd`, etc) and the firewall rules with `iptables` will also have to be carefully applied into that namespace. Considering for this test we are using all interfaces in the dataplane, this point is moot, and we'll take a small shortcut and introduce the `tap` devices in the default namespace.
In the configuration file, I added a `startup-config` (also known as `exec`) stanza. This is a set of VPP CLI commands that will be executed every time the process starts. It's a great way to get the VPP plumbing done ahead of time. I figured, if I let VPP take the network cards, but then re-present `tap` interfaces with names which have the same name that the Linux kernel driver would've given them, the rest of the machine will mostly just work.
So the final trick is to disable every interface in `/etc/nework/interfaces` on dcg-1 and then configure it with a combination of a `/etc/vpp/vpp-exec.conf` and a small shell script that puts the IP addresses and things back just the way Debian would've put them using the `/etc/network/interfaces` file. Here we go!
```
# Loopback interface
create loopback interface instance 0
lcp create loop0 host-if lo0
# Core: dcg-2
lcp create GigabitEthernet6/0/0 host-if eno1
# Infra: Not used.
lcp create GigabitEthernet7/0/0 host-if eno2
# LACP to Arista core switch
create bond mode lacp id 0
set interface state TenGigabitEthernet1/0/0 up
set interface mtu packet 1500 TenGigabitEthernet1/0/0
set interface state TenGigabitEthernet1/0/1 up
set interface mtu packet 1500 TenGigabitEthernet1/0/1
bond add BondEthernet0 TenGigabitEthernet1/0/0
bond add BondEthernet0 TenGigabitEthernet1/0/1
set interface mtu packet 1500 BondEthernet0
lcp create BondEthernet0 host-if bond0
# VLANs on bond0
create sub-interfaces BondEthernet0 100
lcp create BondEthernet0.100 host-if bond0.100
create sub-interfaces BondEthernet0 101
lcp create BondEthernet0.101 host-if bond0.101
create sub-interfaces BondEthernet0 102
lcp create BondEthernet0.102 host-if bond0.102
create sub-interfaces BondEthernet0 120
lcp create BondEthernet0.120 host-if bond0.120
create sub-interfaces BondEthernet0 201
lcp create BondEthernet0.201 host-if bond0.201
create sub-interfaces BondEthernet0 202
lcp create BondEthernet0.202 host-if bond0.202
create sub-interfaces BondEthernet0 205
lcp create BondEthernet0.205 host-if bond0.205
create sub-interfaces BondEthernet0 206
lcp create BondEthernet0.206 host-if bond0.206
create sub-interfaces BondEthernet0 2481
lcp create BondEthernet0.2481 host-if bond0.2481
# NLIX
lcp create TenGigabitEthernet1/0/2 host-if enp1s0f2
create sub-interfaces TenGigabitEthernet1/0/2 7
lcp create TenGigabitEthernet1/0/2.7 host-if enp1s0f2.7
create sub-interfaces TenGigabitEthernet1/0/2 26
lcp create TenGigabitEthernet1/0/2.26 host-if enp1s0f2.26
# Fusix Networks
lcp create TenGigabitEthernet1/0/3 host-if enp1s0f3
create sub-interfaces TenGigabitEthernet1/0/3 108
lcp create TenGigabitEthernet1/0/3.108 host-if enp1s0f3.108
create sub-interfaces TenGigabitEthernet1/0/3 110
lcp create TenGigabitEthernet1/0/3.110 host-if enp1s0f3.110
create sub-interfaces TenGigabitEthernet1/0/3 300
lcp create TenGigabitEthernet1/0/3.300 host-if enp1s0f3.300
```
And then to set up the IP address information, a small shell script:
```
ip link set lo0 up mtu 16384
ip addr add 94.142.247.1/32 dev lo0
ip addr add 2a02:898:0:300::1/128 dev lo0
ip link set eno1 up mtu 1500
ip addr add 94.142.247.224/31 dev eno1
ip addr add 2a02:898:0:301::12/127 dev eno1
ip link set eno2 down
ip link set bond0 up mtu 1500
ip link set bond0.100 up mtu 1500
ip addr add 94.142.244.252/24 dev bond0.100
ip addr add 2a02:898::d1/64 dev bond0.100
ip link set bond0.101 up mtu 1500
ip addr add 172.28.0.252/24 dev bond0.101
ip link set bond0.102 up mtu 1500
ip addr add 94.142.247.44/29 dev bond0.102
ip addr add 2a02:898:0:e::d1/64 dev bond0.102
ip link set bond0.120 up mtu 1500
ip addr add 94.142.247.236/31 dev bond0.120
ip addr add 2a02:898:0:301::6/127 dev bond0.120
ip link set bond0.201 up mtu 1500
ip addr add 94.142.246.252/24 dev bond0.201
ip addr add 2a02:898:62:f6::fffd/64 dev bond0.201
ip link set bond0.202 up mtu 1500
ip addr add 94.142.242.140/28 dev bond0.202
ip addr add 2a02:898:100::d1/64 dev bond0.202
ip link set bond0.205 up mtu 1500
ip addr add 94.142.242.98/27 dev bond0.205
ip addr add 2a02:898:17::fffe/64 dev bond0.205
ip link set bond0.206 up mtu 1500
ip addr add 185.52.224.92/28 dev bond0.206
ip addr add 2a02:898:90:1::2/125 dev bond0.206
ip link set bond0.2481 up mtu 1500
ip addr add 94.142.247.82/29 dev bond0.2481
ip addr add 2a02:898:0:f::2/64 dev bond0.2481
ip link set enp1s0f2 up mtu 1500
ip link set enp1s0f2.7 up mtu 1500
ip addr add 193.239.117.111/22 dev enp1s0f2.7
ip addr add 2001:7f8:13::a500:8283:1/64 dev enp1s0f2.7
ip link set enp1s0f2.26 up mtu 1500
ip addr add 213.207.10.53/26 dev enp1s0f2.26
ip addr add 2a02:10:3::a500:8283:1/64 dev enp1s0f2.26
ip link set enp1s0f3 up mtu 1500
ip link set enp1s0f3.108 up mtu 1500
ip addr add 94.142.247.243/31 dev enp1s0f3.108
ip addr add 2a02:898:0:301::15/127 dev enp1s0f3.108
ip link set enp1s0f3.110 up mtu 1500
ip addr add 37.139.140.23/31 dev enp1s0f3.110
ip addr add 2a00:a7c0:e20b:110::2/126 dev enp1s0f3.110
ip link set enp1s0f3.300 up mtu 1500
ip addr add 185.1.94.15/24 dev enp1s0f3.300
ip addr add 2001:7f8:b6::205b:1/64 dev enp1s0f3.300
```
## 4. Results
And this is where it went horribly wrong. After installing the VPP packages on the dcg-1 machine, running Debian Buster on a Supermicro Super Server/X11SCW-F with BIOS 1.5 dated 10/12/2020, the `vpp` process was unable to bind the PCI devices for the Intel x710 NICs. I tried the following combinations:
* Stock Buster kernel `4.19.0-14-amd64` and Backports kernel `5.10.0-0.bpo.3-amd64`.
* The kernel driver `vfio-pci` and the DKMS for `igb_uio` from Debian package `dpdk-igb-uio-dkms`.
* Intel IOMMU off, on and strict (kernel boot parameter `intel_iommu=on` and `intel_iommu=strict`)
* BIOS setting for Intel VT-d on and off.
Each time, I would start VPP with an explicit `dpdk {}` stanza, and observed the following. With the default `vfio-pci` driver, the VPP process would not start, and instead it would be spinning loglines:
```
[ 74.378330] vfio-pci 0000:01:00.0: Masking broken INTx support
[ 74.384328] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0
## Repeated for all of the NICs 0000:01:00.[0123]
```
Commenting out the `dpdk { dev 0000:01:00.* }` devices would allow it to start, detect the two i210 NICs, which both worked fine.
With the `igb_uio` driver, VPP would start, but not detect the x710 devices at all, it would detect the two i210 NICs, but they would not pass traffic or even link up:
```
[ 139.495061] igb_uio 0000:01:00.0: uio device registered with irq 128
[ 139.522507] DMAR: DRHD: handling fault status reg 2
[ 139.528383] DMAR: [DMA Read] Request device [01:00.0] PASID ffffffff fault addr 138dac000 [fault reason 06] PTE Read access is not set
## Repeated for all 6 NICs
```
I repeated this test of both drivers for all combinations of kernel, IOMMU and BIOS settings for VT-d, with exactly identical results.
### Baseline
In a traceroute from BIT to Coloclue (using Junipers on hops 1-3, Linux kernel routing on hop 4), it's clear that (a) only NL-IX is used on hop 4, which means that only dcg-1 is in the path and no other routers at Coloclue. From hop 4 onwards, one can clearly see high variance, with a 49.7ms standard deviation on a **~247.1ms** worst case, even though the end to end latency is only 1.6ms and the NL-IX port is not congested.
```
sandy (193.109.122.4) 2021-03-27T22:36:11+0100
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. lo0.leaf-sw4.bit-2b.network.bit.nl 0.0% 4877 0.3 0.2 0.1 7.8 0.2
2. lo0.leaf-sw6.bit-2a.network.bit.nl 0.0% 4877 0.3 0.2 0.2 1.1 0.1
3. xe-1-3-1.jun1.bit-2a.network.bit.nl 0.0% 4877 0.5 0.3 0.2 9.3 0.7
4. coloclue.the-datacenter-group.nl-ix.net 0.2% 4877 1.8 18.3 1.7 253.5 45.0
5. vpp-test.ams.ipng.ch 0.1% 4877 1.9 23.6 1.6 247.1 49.7
```
On the return path, seen by a traceroute from Coloclue to BIT (using Linux kernel routing on hop 2, Junipers on hops 2-4), it becomes clear that the very first hop (the Linux machine dcg-1) is contributing to high variance, with a 49.4ms standard deviation on a **257.9ms** worst case, again on an NL-IX port that was not congested and easy sailing in BIT's 10Gbit network from there on.
```
vpp-test (185.52.225.1) 2021-03-27T21:36:43+0000
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. bond0-100.dcg-1.router.nl.coloclue.net 0.1% 4839 0.2 12.9 0.1 251.2 38.2
2. bit.bit2.nl-ix.net 0.0% 4839 10.7 22.6 1.4 261.8 48.3
3. lo0.leaf-sw5.bit-2a.network.bit.nl 0.0% 4839 1.8 20.9 1.6 263.0 46.9
4. lo0.leaf-sw3.bit-2b.network.bit.nl 0.0% 4839 155.7 22.7 1.4 282.6 50.9
5. sandy.ede.ipng.nl 0.0% 4839 1.8 22.9 1.6 257.9 49.4
```
### New Configuration
As I mentioned, I had expected this article to have a different outcome, in that I would've wanted to show off the superior routing performance under VPP of the beacon `185.52.225.1/24` which is found from AS12859 (BIT) via NL-IX directly through dcg-1. Alas, I did not manage to get the Intel x710 NIC to work with VPP, I ultimately rolled back but kept a few settings (Intel VT-d enabled and IOMMU on, hyperthreading disabled, Linux kernel 5.10 which uses a much newer version of the `i40e` for the NIC).
That combination definitely helped, the latency is now very smooth between BIT and Coloclue, a mean latency of 1.7ms, worst case 4.3ms and a standard deviation of **0.2ms** only. That is as good as you could expect:
```
sandy (193.109.122.4) 2021-03-28T16:20:05+0200
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. lo0.leaf-sw4.bit-2b.network.bit.nl 0.0% 4342 0.3 0.2 0.2 0.4 0.1
2. lo0.leaf-sw6.bit-2a.network.bit.nl 0.0% 4342 0.3 0.2 0.2 0.9 0.1
3. xe-1-3-1.jun1.bit-2a.network.bit.nl 0.0% 4341 0.4 1.0 0.3 28.3 2.3
4. coloclue.the-datacenter-group.nl-ix.net 0.0% 4341 1.8 1.8 1.7 3.4 0.1
5. vpp-test.ams.ipng.ch 0.0% 4341 1.8 1.7 1.7 4.3 0.2
```
On the return path, seen by a traceroute again from Coloclue to BIT, it becomes clear that dcg-1 is no longer causing jitter or loss, at least not to NL-IX and AS12859. The latency there is as well an expected 1.8ms with a worst cast of 3.5ms and a standard deviation of **0.1ms**, in other words comparable to the BIT --> Coloclue path:
```
vpp-test (185.52.225.1) 2021-03-28T14:20:50+0000
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. bond0-100.dcg-1.router.nl.coloclue.net 0.0% 4303 0.2 0.2 0.1 0.9 0.1
2. bit.bit2.nl-ix.net 0.0% 4303 1.6 2.2 1.4 17.1 2.2
3. lo0.leaf-sw5.bit-2a.network.bit.nl 0.0% 4303 1.8 1.7 1.6 6.6 0.4
4. lo0.leaf-sw3.bit-2b.network.bit.nl 0.0% 4303 1.6 1.5 1.4 4.2 0.2
5. sandy.ede.ipng.nl 0.0% 4303 1.9 1.8 1.7 3.5 0.1
```
## Appendix
Assorted set of notes -- because I did give it "one last try" and managed to get VPP to almost work on this Coloclue router :)
* Boot kernel 5.10 with `intel_iommu=on iommu=pt`
* Load kernel module `igb_uio` and unload `vfio-pci` before starting VPP
What follows is a bunch of debugging information -- useful perhaps for a future attempt at running VPP at Coloclue.
```
root@dcg-1:/etc/vpp# tail -10 startup.conf
dpdk {
uio-driver igb_uio
dev 0000:06:00.0
dev 0000:07:00.0
dev 0000:01:00.0
dev 0000:01:00.1
dev 0000:01:00.2
dev 0000:01:00.3
}
root@dcg-1:/etc/vpp# lsmod | grep uio
uio_pci_generic 16384 0
igb_uio 20480 5
uio 20480 12 igb_uio,uio_pci_generic
[ 39.211999] igb_uio: loading out-of-tree module taints kernel.
[ 39.218094] igb_uio: module verification failed: signature and/or required key missing - tainting kernel
[ 39.228147] igb_uio: Use MSIX interrupt by default
[ 91.595243] igb 0000:06:00.0: removed PHC on eno1
[ 91.716041] igb_uio 0000:06:00.0: mapping 1K dma=0x101c40000 host=0000000095299b4e
[ 91.723683] igb_uio 0000:06:00.0: unmapping 1K dma=0x101c40000 host=0000000095299b4e
[ 91.733221] igb 0000:07:00.0: removed PHC on eno2
[ 91.856255] igb_uio 0000:07:00.0: mapping 1K dma=0x101c40000 host=0000000095299b4e
[ 91.863918] igb_uio 0000:07:00.0: unmapping 1K dma=0x101c40000 host=0000000095299b4e
[ 91.988718] igb_uio 0000:06:00.0: uio device registered with irq 127
[ 92.039935] igb_uio 0000:07:00.0: uio device registered with irq 128
[ 105.040391] i40e 0000:01:00.0: i40e_ptp_stop: removed PHC on enp1s0f0
[ 105.232452] igb_uio 0000:01:00.0: mapping 1K dma=0x103a64000 host=00000000bc39c074
[ 105.240108] igb_uio 0000:01:00.0: unmapping 1K dma=0x103a64000 host=00000000bc39c074
[ 105.249142] i40e 0000:01:00.1: i40e_ptp_stop: removed PHC on enp1s0f1
[ 105.472489] igb_uio 0000:01:00.1: mapping 1K dma=0x180187000 host=000000003182585c
[ 105.480148] igb_uio 0000:01:00.1: unmapping 1K dma=0x180187000 host=000000003182585c
[ 105.489178] i40e 0000:01:00.2: i40e_ptp_stop: removed PHC on enp1s0f2
[ 105.700497] igb_uio 0000:01:00.2: mapping 1K dma=0x12108a000 host=000000006ccf7ec6
[ 105.708160] igb_uio 0000:01:00.2: unmapping 1K dma=0x12108a000 host=000000006ccf7ec6
[ 105.717272] i40e 0000:01:00.3: i40e_ptp_stop: removed PHC on enp1s0f3
[ 105.916553] igb_uio 0000:01:00.3: mapping 1K dma=0x121132000 host=00000000a0cf9ceb
[ 105.924214] igb_uio 0000:01:00.3: unmapping 1K dma=0x121132000 host=00000000a0cf9ceb
[ 106.051801] igb_uio 0000:01:00.0: uio device registered with irq 127
[ 106.131501] igb_uio 0000:01:00.1: uio device registered with irq 128
[ 106.211155] igb_uio 0000:01:00.2: uio device registered with irq 129
[ 106.288722] igb_uio 0000:01:00.3: uio device registered with irq 130
[ 106.367089] igb_uio 0000:06:00.0: uio device registered with irq 130
[ 106.418175] igb_uio 0000:07:00.0: uio device registered with irq 131
### Note above: Gi6/0/0 and Te1/0/3 both use irq 130.
root@dcg-1:/etc/vpp# vppctl show log | grep dpdk
2021/03/28 15:57:09:184 notice dpdk EAL: Detected 6 lcore(s)
2021/03/28 15:57:09:184 notice dpdk EAL: Detected 1 NUMA nodes
2021/03/28 15:57:09:184 notice dpdk EAL: Selected IOVA mode 'PA'
2021/03/28 15:57:09:184 notice dpdk EAL: No available hugepages reported in hugepages-1048576kB
2021/03/28 15:57:09:184 notice dpdk EAL: No free hugepages reported in hugepages-1048576kB
2021/03/28 15:57:09:184 notice dpdk EAL: No available hugepages reported in hugepages-1048576kB
2021/03/28 15:57:09:184 notice dpdk EAL: Probing VFIO support...
2021/03/28 15:57:09:184 notice dpdk EAL: WARNING! Base virtual address hint (0xa80001000 != 0x7eff80000000) not respected!
2021/03/28 15:57:09:184 notice dpdk EAL: This may cause issues with mapping memory into secondary processes
2021/03/28 15:57:09:184 notice dpdk EAL: WARNING! Base virtual address hint (0xec0c61000 != 0x7efb7fe00000) not respected!
2021/03/28 15:57:09:184 notice dpdk EAL: This may cause issues with mapping memory into secondary processes
2021/03/28 15:57:09:184 notice dpdk EAL: WARNING! Base virtual address hint (0xec18c2000 != 0x7ef77fc00000) not respected!
2021/03/28 15:57:09:184 notice dpdk EAL: This may cause issues with mapping memory into secondary processes
2021/03/28 15:57:09:184 notice dpdk EAL: WARNING! Base virtual address hint (0xec2523000 != 0x7ef37fa00000) not respected!
2021/03/28 15:57:09:184 notice dpdk EAL: This may cause issues with mapping memory into secondary processes
2021/03/28 15:57:09:184 notice dpdk EAL: Invalid NUMA socket, default to 0
2021/03/28 15:57:09:184 notice dpdk EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.0 (socket 0)
2021/03/28 15:57:09:184 notice dpdk EAL: Invalid NUMA socket, default to 0
2021/03/28 15:57:09:184 notice dpdk EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.1 (socket 0)
2021/03/28 15:57:09:184 notice dpdk EAL: Invalid NUMA socket, default to 0
2021/03/28 15:57:09:184 notice dpdk EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.2 (socket 0)
2021/03/28 15:57:09:184 notice dpdk EAL: Invalid NUMA socket, default to 0
2021/03/28 15:57:09:184 notice dpdk EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.3 (socket 0)
2021/03/28 15:57:09:184 notice dpdk i40e_init_fdir_filter_list(): Failed to allocate memory for fdir filter array!
2021/03/28 15:57:09:184 notice dpdk ethdev initialisation failed
2021/03/28 15:57:09:184 notice dpdk EAL: Requested device 0000:01:00.3 cannot be used
2021/03/28 15:57:09:184 notice dpdk EAL: VFIO support not initialized
2021/03/28 15:57:09:184 notice dpdk EAL: Couldn't map new region for DMA
root@dcg-1:/etc/vpp# vppctl show pci
Address Sock VID:PID Link Speed Driver Product Name Vital Product Data
0000:01:00.0 0 8086:1572 8.0 GT/s x8 igb_uio
0000:01:00.1 0 8086:1572 8.0 GT/s x8 igb_uio
0000:01:00.2 0 8086:1572 8.0 GT/s x8 igb_uio
0000:01:00.3 0 8086:1572 8.0 GT/s x8 igb_uio
0000:06:00.0 0 8086:1533 2.5 GT/s x1 igb_uio
0000:07:00.0 0 8086:1533 2.5 GT/s x1 igb_uio
root@dcg-1:/etc/vpp# ip ro
94.142.242.96/27 dev bond0.205 proto kernel scope link src 94.142.242.98
94.142.242.128/28 dev bond0.202 proto kernel scope link src 94.142.242.140
94.142.244.0/24 dev bond0.100 proto kernel scope link src 94.142.244.252
94.142.246.0/24 dev bond0.201 proto kernel scope link src 94.142.246.252
94.142.247.40/29 dev bond0.102 proto kernel scope link src 94.142.247.44
94.142.247.80/29 dev bond0.2481 proto kernel scope link src 94.142.247.82
94.142.247.224/31 dev eno1 proto kernel scope link src 94.142.247.224
94.142.247.236/31 dev bond0.120 proto kernel scope link src 94.142.247.236
172.28.0.0/24 dev bond0.101 proto kernel scope link src 172.28.0.252
185.52.224.80/28 dev bond0.206 proto kernel scope link src 185.52.224.92
193.239.116.0/22 dev enp1s0f2.7 proto kernel scope link src 193.239.117.111
213.207.10.0/26 dev enp1s0f2.26 proto kernel scope link src 213.207.10.53
root@dcg-1:/etc/vpp# birdc6 show ospf neighbors
BIRD 1.6.6 ready.
ospf1:
Router ID Pri State DTime Interface Router IP
94.142.247.2 1 Full/PtP 00:35 eno1 fe80::ae1f:6bff:feeb:858c
94.142.247.7 128 Full/PtP 00:35 bond0.120 fe80::9ecc:8300:78b2:8b62
root@dcg-1:/etc/vpp# birdc show ospf neighbors
BIRD 1.6.6 ready.
ospf1:
Router ID Pri State DTime Interface Router IP
94.142.247.2 1 Exchange/PtP 00:37 eno1 94.142.247.225
94.142.247.7 128 Exchange/PtP 00:39 bond0.120 94.142.247.237
root@dcg-1:/etc/vpp# vppctl show bond details
BondEthernet0
mode: lacp
load balance: l2
number of active members: 2
TenGigabitEthernet1/0/0
TenGigabitEthernet1/0/1
number of members: 2
TenGigabitEthernet1/0/0
TenGigabitEthernet1/0/1
device instance: 0
interface id: 0
sw_if_index: 6
hw_if_index: 6
root@dcg-1:/etc/vpp# ping 193.239.116.1
PING 193.239.116.1 (193.239.116.1) 56(84) bytes of data.
64 bytes from 193.239.116.1: icmp_seq=1 ttl=64 time=2.24 ms
64 bytes from 193.239.116.1: icmp_seq=2 ttl=64 time=0.571 ms
64 bytes from 193.239.116.1: icmp_seq=3 ttl=64 time=0.625 ms
^C
--- 193.239.116.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 5ms
rtt min/avg/max/mdev = 0.571/1.146/2.244/0.777 ms
root@dcg-1:/etc/vpp# ping 94.142.244.85
PING 94.142.244.85 (94.142.244.85) 56(84) bytes of data.
64 bytes from 94.142.244.85: icmp_seq=1 ttl=64 time=0.226 ms
64 bytes from 94.142.244.85: icmp_seq=2 ttl=64 time=0.207 ms
64 bytes from 94.142.244.85: icmp_seq=3 ttl=64 time=0.200 ms
64 bytes from 94.142.244.85: icmp_seq=4 ttl=64 time=0.204 ms
^C
--- 94.142.244.85 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 66ms
rtt min/avg/max/mdev = 0.200/0.209/0.226/0.014 ms
```
### Cleaning up
```
apt purge dpdk* vpp*
apt autoremove
rm -rf /etc/vpp
rm /etc/sysctl.d/*vpp*.conf
cp /etc/network/interfaces.2021-03-28 /etc/network/interfaces
cp /root/.ssh/authorized_keys.2021-03-28 /root/.ssh/authorized_keys
systemctl enable bird
systemctl enable bird6
systemctl enable keepalived
reboot
```
### Next steps
Taking another look at IOMMU and PT [redhat thread](https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/installation_guide/appe-configuring_a_hypervisor_host_for_pci_passthrough) and in particular the part about `allow_unsafe_interrupts` in the kernel module. Find some ways to get the NICs (1x Intel x710 and 2x Intel i210) to detect in VPP. By then, probably the Linux CP (Interface mirroring and Netlink listener) will be submitted.

View File

@ -0,0 +1,144 @@
---
date: "2021-05-17T22:27:34Z"
title: IPng arrives in Frankfurt
---
I've been planning a network expansion for a while now. For the next few weeks,
I will be in total geek-mode as I travel to several European cities to deploy
AS50869 on a european ring. At the same time, my buddy Fred from
[IP-Max](https://ip-max.net/) has been wanting to go to Amsterdam. IP-Max's
[network](https://as25091.peeringdb.com/) is considerably larger than mine, but
it just never clicked with the right set of circumstances for them to deploy
in the Netherlands, until the stars aligned ...
## Leadup to the Roadtrip
Usually, IP-Max deploys their routers by having them shipped into the destination
location, but this time was special. We decided to make a roadtrip out of it,
so Fred made his way from Geneva to Br&uuml;ttisellen, stayed the night, and early
on Monday May 17th, we packed up the car and started our trek.
It turns out we had estimated our risk profile completely wrong - we thought it
would be hard to cross the border into Germany due to the ongoing pandemic, but
actually that part was fine. The Germans had opened their borders for transit
traffic and stays of up to 24hrs just a few days ago, and we both got a
(negative) PCR test so we felt we had our bases covered.
## The Border
Then when we arrived at the border, perhaps because we had Geneva license
plates, we were asked about our trip, business or pleasure, and we shared that
we had some equipment with us. Thus begun the four-and-a-half hour customs
exercise that was necessary for us to safely send our equipment off to
the European Union. One would think it should be easy, but it actually wasn't
quite that easy, considering we arrived at the border at 9am on a Monday, and
the traffic into Switzerland was queueing up all expeditor and logistics
companies, so nobody really was willing to help us out. But we made it and left
again shortly after 1:30pm.
## Frankfurt
{{< image width="300px" float="right" src="/assets/network/defra0-rack.png" alt="IP-Max at Frankfurt" >}}
We arrived at Frankfurt Equinix FR5 at the Kleyerstrasse at around 5pm. The
IP-Max rack was quickly found, and while Fred was installing their corporate
Xen host to run remote VMs for the Frankfurt area, I deployed the first router
of the trip: **defra0.ipng.ch**.
IP-Max at this location has a respectable 30G of DWDM capacity from three
different vendors into Zurich, 30G of LAG capacity towards DE-CIX, and a
10G DWDM wave into Anzin (France), which will be broken up for us in Amsterdam
for a future blogpost - stay tuned :)
Making use of line card and route processor redundancy, we decided to use
three line cards, reserving one TenGig ethernet port on each:
* Te0/0/0/4 -- EoMPLS to NTT/eShelter Rumlang (**chrma0.ipng.ch**)
* Te0/1/0/4 -- EoMPLS to Interxion ZUR1 (**chgtg0.ipng.ch**)
* Te0/2/0/4 -- EoMPLS to Amsterdam NIKHEF (**nlams0.ipng.ch**)
At each site, specifically those that are a bit further away, I deploy a
standard issue [PCEngines APU](https://pcengines.ch/) with 802.11ac WiFi,
serial, and IPMI access to any machine that may be there. If you ever visit
a datacenter floor where I'm present, look for SSID _AS50869 FRA_ in the
case of Kleyerstrasse. The password is _IPngGuest_, you're welcome to some
bits of bandwidth in a pinch :)
You can see my router dangling off what looks like a fiber optic umbellical
cord under **er01.fra05.ip-max.net**, right at the heart of the Frankfurt
internet.
### Logical Configuration
{{< image width="300px" float="left" src="/assets/network/console-fra.png" alt="console-fra.ipng.nl" >}}
**console.fra.ipng.nl** At the top of the rack you can also see the blue APU3
with its WiFi antennas. It takes an IPv4 /29 and IPv6 /64 from IP-Max AS25091
which gives me access to my equipment even if bad things happen (and they will,
it's just a matter of time!). It also exposes a WireGuard so that I can access
it even without the need for SSH which can come in useful if a KVM console is
required. Note the logo :-)
On the inside of the APU, it configures one RFC1918 wifi segment and another
RFC1918 wired segment. In this case, the wired segment is connected to the
IPMI port of the Supermicro router. I have really gotten used to this style
of deployment -- I **start** with the OOB. Once the APU has power (and it does
not need to have an uplink yet), I can already SSH to it from the wireless
segment, and further configure it. Once it's done, I make a habit of rebooting
it to ensure it comes up. Then, I can easily configure (and even entirely
install!!) the server behind it using IPMI serial-over-lan and HTML5 KVM
if need be. It's delicious. And, it has saved my ass several times over the
years!
{{< image width="300px" float="left" src="/assets/network/defra0.png" alt="defra0.ipng.ch" >}}
**defra0.ipng.ch** Making use of the line card redundancy, there is now 3x
10Gig connected to my router, which immediately makes it one of the better
connected hosts in this facility. Logging in via IPMI, the [DANOS](https://danosproject.org)
image is quickly configured. There's one link to Interxion ZUR1 in Glattbrugg,
one link to eShelter in R&uuml;mlang, and one link up to Amsterdam. The
interface towards Interxion ZUR1 doubles up as an egresspoint for now. There
will be an IPv4/IPv6 transit session with AS25091, a [DE-CIX](https://de-cix.net)
connection and possibly but probably not a [Kleyrex](https://kleyrex.net)
connection, were it not for the murderous cross connect costs at this facility.
## The results
{{< image width="100px" float="right" src="/assets/network/iperf-chgtg0-defra0.png" alt="iperf" >}}
After the OSPF and OSPFv3 adjacencies came up, iBGP was next. For now, the
machine is single-homed off of **chrma0.ipng.ch** but soon there will be as
well a leg towards Amsterdam. So for now, all that we can do is test basic
connectivity. So after finishing our trip to Amsterdam, and checking into
our AirBnB ready to go through our quarantine song-and-dance, we spent a
little time celebrating - we arrived at 1:30am, and turned in for the night
at 3am. The next day, our groceries arrived, somehow unfortunately I had to
be "well prepared" and ordered them to be delivered between 7-8am on Tuesday.
After a full day of _regular work_, we spent the evening taking a look at
how my kit performs, and we are happy to report it's absolutely great:
```
pim@defra0:~$ iperf3 -c chgtg0.ipng.ch -P 10
...
[SUM] 0.00-10.00 sec 11.2 GBytes 9.63 Gbits/sec 281 sender
[SUM] 0.00-10.02 sec 11.2 GBytes 9.56 Gbits/sec receiver
pim@defra0:~$ iperf3 -c chgtg0.ipng.ch -P 10 -R
...
[SUM] 0.00-10.01 sec 10.2 GBytes 8.73 Gbits/sec 550 sender
[SUM] 0.00-10.00 sec 10.1 GBytes 8.70 Gbits/sec receiver
pim@defra0:~$ ping4 chrma0.ipng.ch
PING chrma0.ipng.ch (194.1.163.0) 56(84) bytes of data.
...
--- chrma0.ipng.ch ping statistics ---
9 packets transmitted, 9 received, 0% packet loss, time 20ms
rtt min/avg/max/mdev = 5.864/6.022/6.173/0.072 ms
```
The roundtrip latency to Zurich is about 6.0ms, and the performance is north of
9Gbit in both directions for my router. Soon, we will go to Amsterdam, and
deploy router number two (of four!) on this epic roadtrip: **nlams0.ipng.ch**
which is a bucket list item of mine -- to peer at Amsterdam Science Park.
More on that later!

View File

@ -0,0 +1,216 @@
---
date: "2021-05-26T21:19:34Z"
title: IPng arrives in Amsterdam
---
I've been planning a network expansion for a while now. For the next few weeks,
I will be in total geek-mode as I travel to several European cities to deploy
AS50869 on a european ring. At the same time, my buddy Fred from
[IP-Max](https://ip-max.net/) has been wanting to go to Amsterdam. IP-Max's
[network](https://as25091.peeringdb.com/) is considerably larger than mine, but
it just never clicked with the right set of circumstances for them to deploy
in the Netherlands, until the stars aligned ...
## Leadup for IP-Max
Usually, if I were to go deploy somewhere with IP-Max, I settle down on top of
(or underneath, or in some way physically close to) their router at a point of
presence of theirs. In Amsterdam though, it was different ... because IP-Max
had not yet _built_ a PoP here.
But I ask: why would that stop us? Fred told me last year that he had always
wanted to build out a PoP in Amsterdam, but somehow he never really found the
time. I offered to do the work to organize the local supplier chain, get a
good spot in a well connected place, long haul to France and Germany, and
otherwise exercise my (social) network to get it done.
In March 2021, I stumbled across rackspace at [NIKHEF](https://barm.nikhef.nl/housing/)
by working with the folks from [ERITAP](https://eritap.com/) who got their hands
on something that is less of a commodity: a full rack + power (the facility is
always chronically oversubscribed).
A few chores on the tasklist:
1. Sign for rackspace. Check.
1. Order the IP-Max standard-issue _small pop_ kit, which consists of:
* One Cisco [ASR9001](https://www.cisco.com/c/en/us/products/collateral/routers/asr-9001-router/data_sheet_c78-685687.html)
* One Nexus [3064PQ](https://www.cisco.com/c/en/us/support/switches/nexus-3064-switch/model.html)
* One PCEngines [APU4](https://www.pcengines.ch/apu4c4.htm) for out-of-band
* And all the power/copper/fiber cables, optics, serial dongles we might need
1. Procure an out-of-band provider for our APU4, easily found at NIKHEF (thanks, [Arend](https://eritap.com/))!
1. Get connectivity in and out of Amsterdam!
### Connectivity
The most important piece of planning is around the long haul connectivity.
Considering IP-Max already operates a circuit from Frankfurt (Germany) to
Anzin (France), I arranged for that link to be rerouted through Amsterdam
and broken into two segments: Frankfurt-Amsterdam and Amsterdam-Anzin. I
was like a kid in a candy store being able to meticulously choose the route
that the fiber takes -- over D&uuml;sseldorf, entering the Netherlands at
Emmerik, over Arnhem and Ede, and to Amsterdam. A very direct route, using
a 10Gig DWDM wave.
The other span goes from Amsterdam through Antwerp (Belgium) and Brussels
and finally landing in Anzin (near Lille, France), which was the previous
10Gig DWDM wave, so there is no increased latency even though the link is
broken up in Amsterdam. Yaay!
Delivery of the DWDM waves was ordered on March 30th, and although it should
normally take 25 working days to deliver, for some awkward reason with the
supplier it was going to take way longer than what we could afford, so a
spot of VP style escalation took place, and oh look! Now it would take four
weeks to turn up, was completed last Friday, which was just in time
for our trip. Double yaay!
## Staging Amsterdam
{{< image width="300px" float="right" src="/assets/network/nlams0-staging.png" alt="Staging Amsterdam" >}}
Because this is a completely new site for [IP-Max](https://ip-max.net) as well
as [IPng](https://ipng.ch/), we'll have to do a bit more work. And this suites
us just fine, because after driving through Frankfurt (see my [previous post]({% post_url 2021-05-17-frankfurt %})),
to the Netherlands, we have to stay in quarantine for five days (or, ten if we
happen to fail our PCR test after five days!), which gives us plenty of time
to stage and configure what will be our Cisco **er01.ams01.ip-max.net** and
our Nexus **as01.ams01.ip-max.net**.
Of course, figuring out how all of this fits together is a nice exercise, and
we planned to just _plug and play_ the ASR9k, which worked out rather
successfully by the way, so it had to be completely configured ahead of time.
We created the interfaces, DNS, routing protocols like OSPF, OSPFv3, MPLS/LDP,
BGP and all of the good stuff like ACLs, accounts and et cetera.
We staged the stuff in the laundry room of our AirBnB, being actually quite
grateful once the staging was complete and we could turn the machines off.
For IPng, staging **nlams0.ipng.ch** was already done ahead of time. So all
I really needed for it, was to ensure that the EoMPLS circuits were created
ahead of time. I was really looking forward to seeing if we could beat 14ms
to Amsterdam on the [IP-Max](https://ip-max.net/) network.
## Extracurriculars
{{< image width="400px" float="right" src="/assets/network/airbnb-staging.png" alt="Dell AirBNB" >}}
Besides the staging, we also ate some pretty delicious food:
* Mushroom risotto
* HotPot with Arend and Esther
* Chicken vegetable soup
* Tacos w/ Tapas
* Steak w/ broccoli and potatoes
* Red tuna w/ beans and herbs
But we also took the time to explore a little bit, for example on Kaz's new
boat through the canals and over the river Amstel. But mostly: we sat home and
enjoyed our quarantine the best we could :-)
## Deployment (day 1)
{{< image width="400px" float="right" src="/assets/network/nlams0-staging-day1.png" alt="IP-Max Staging" >}}
First before the day started, I drained the Frankfurt-Anzin link by raising
OSPF cost on **er01.fra01.ip-max.net** and **er01.lil02.ip-max.net** while Fred
notified customers and the IP-Max team of the impending update to the network.
We met up with [ERITAP](https://eritap.com/) on Monday 24th, or target deploy
date. We had labeled and packed up all of our gear, grabbed the car, and made
our way to the Watergraafsmeer to the place where the Internet landed in Europe
in 1982. Almost 40 years later, here we are: IP-Max is moving in!
The physical work was not very exciting. The Nexus, ASR, two APUs and my own
Supermicro were racked in only a few minutes. But then the interesting bits
begin -- how do we connect all of this without making a _Kabelsalat_ that you
so often see in people's racks.
{{< image width="400px" float="right" src="/assets/network/nlams0.png" alt="nlams0" >}}
But yet at the same time, both Fred and I were enthusiastic and couldn't wait
to see the ping time to Anzin and Frankfurt from here. I left Fred the honors
to connect his own brand new **er01.ams01.ip-max.net** by opening the patched
through loop from our supplier, and he was beaming once he saw OSPF and OSPFv3
adjacencies and a latency of just short of 6ms. But he was very kind to let
me do the second honors to connect the router to Anzin, at just over 5ms. That
is a really fantastic performance and very short path indeed. This will be fun
for my next adventure, I'm sure. We'll see the Dell pictured above appear as
**frlil0.ipng.ch** but I get ahead of myself ..
After we connected the whole thing up and did extensive ping tests, we
undrained the spans and saw a respectable 600Mbit of traffic traverse the
new router. Because there were a few other folks tinkering in the rack (for
example our friends from [Coloclue](https://coloclue.net/) we decided to
adjourn for the day and visit Paul and Henrieke up in Almere for a fabulous
homecooked meal (thanks again for the Pica&ntilde;a!) and we enjoyed being
followed by the cops when driving back out of Almere -- but we were not
bothered/hassled by them.
## Deployment (day 2)
{{< image width="400px" float="right" src="/assets/network/nlams0-pim-sad.png" alt="Pim Cries" >}}
But then (and this is technically day 2 because it was, let's just say, well
after midnight), as the IP-Max network calmed down for the night I did my
stress test and came to a horrible surprise, interface errors! They were
Frame Checksum Errors and while the performance from **defra0.ipng.ch** to
**nlams0.ipng.ch** was impeccable (9.2Gbit, yaay), the transfer speeds on
the reversed direction did stall out at about 35Mbit. That is **NOT** what
the Doctor ordered!
So luckily we had already decided to go back for a day2 to complete the
rack install, mostly for things like the fiber patch panel for IP-Max
customers in the ERITAP rack, and to ensure that our power, serial and
network cables would not come loose, because packets don't like loose
cables. Certainly we should avoid the electrons or photons falling onto
the floor...
But the weird thing about my link errors (as seen by the ASR9k) was that
usually the problem is either a duplex error (which was OK), or a dirty
fiber or transciever (which was unlikely considering this link was a
SFP+ DAC!). So that leaves either a faulty Cisco or a faulty Supermicro,
neither of which are appealing.
On day two, after breakfast, we had to do a few chores first (like the claim
for the VAT for imports, see our [previous post]({% post_url 2021-05-17-frankfurt %}),
and as well get a corona PCR test for the way to France (which was absolutely
horrible, by the way, I *still* feel my nose which was violated). So we hit
NIKHEF at around 4pm to finish the job and take care of a few small favors for
Coloclue, ERITAP and Byteworks, who are also in the same rack as IPng and
IP-Max.
## The results
After I replaced the DAC (ironically with an SFP+ optic), once OSPF and iBGP
came back to life, this is what it looked like:
```
pim@chumbucket:~$ traceroute nlams0.ipng.ch
traceroute to nlams0.ipng.ch (194.1.163.32), 30 hops max, 60 byte packets
1 chbtl1.ipng.ch (194.1.163.67) 0.292 ms 0.216 ms 0.179 ms
2 chgtg0.ipng.ch (194.1.163.19) 0.599 ms 0.565 ms 0.531 ms
3 chrma0.ipng.ch (194.1.163.8) 0.873 ms 0.840 ms 0.806 ms
4 defra0.ipng.ch (194.1.163.25) 6.783 ms 6.751 ms 6.718 ms
5 nlams0.ipng.ch (194.1.163.32) 12.864 ms 12.831 ms 12.798 ms
pim@nlams0:~$ iperf3 -P 10 -c chgtg0.ipng.ch
...
[SUM] 0.00-10.00 sec 11.0 GBytes 9.49 Gbits/sec 95 sender
[SUM] 0.00-10.01 sec 11.0 GBytes 9.41 Gbits/sec receiver
pim@nlams0:~$ iperf3 -P 10 -c chgtg0.ipng.ch -R
...
[SUM] 0.00-10.01 sec 10.0 GBytes 8.62 Gbits/sec 339 sender
[SUM] 0.00-10.00 sec 9.98 GBytes 8.57 Gbits/sec receiver
```
{{< image width="400px" float="right" src="/assets/network/nlams0-pim-happy.png" alt="Pim Laughs" >}}
That will do, thanks. I cannot believe that the latency from my basement
workstation in Br&uuml;ttisellen, Switzerland, to the local internet
exchange is 0.8ms, then through to Frankfurt at 6.2ms and then all the way
to Amsterdam the end to end round trip latency is 12.2ms. I can stare at
the smokeping for hours!!
So I spent the reminder of the night hanging out with Fred while pumping
9Gbit in both directions for 2 hours while traffic was low. It's one thing
to do an `iperf` in your basement rack, but it's an entirely different feeling
to do an `iperf` spanning three countries in Europe (CH, DE and NL). I will
note that the spans from Zurich to Frankfurt didn't even get warm, although
the one from Frankfurt to Amsterdam kind of broke a sweat for a little while
there ...
And the coolest thing yet? We're not done with this trip.

View File

@ -0,0 +1,106 @@
---
date: "2021-05-28T22:16:44Z"
title: IPng arrives in Lille
---
I've been planning a network expansion for a while now. For the next few weeks,
I will be in total geek-mode as I travel to several European cities to deploy
AS50869 on a european ring. At the same time, my buddy Fred from
[IP-Max](https://ip-max.net/) has been wanting to go to Amsterdam. IP-Max's
[network](https://as25091.peeringdb.com/) is considerably larger than mine, but
it just never clicked with the right set of circumstances for them to deploy
in the Netherlands, until the stars aligned ...
## Deployment
{{< image width="300px" float="right" src="/assets/network/lille-civ1.png" alt="Lille CIV1" >}}
After our adventure in [Amsterdam]({% post_url 2021-05-26-amsterdam %}), and
after Fred and I both got negative PCR test results, we made our way down to
Lille, France. There are two datacenters there where IP-Max has a presence,
and they are very innovative ones. There's a specific trick with a block of
frozen ice that allows for the facility cooling to run autonomously in case
of a chiller or power failure. I got to see the storage of the icecube :)
Now, because Fred is possibly even more enthusiastic about our eurotrip than
I am, he insisted that I also put a machine here in Lille, even though
originally it was meant to be only Frankfurt, Amsterdam, Paris and Zurich.
After some tough negotiations, I reluctantly agreed. While I did not have a
Supermicro, I did have some freshly procured Dell R610s, the old machines
from Coloclue AS8283, which they have recently upgraded to newer routers.
I installed the pair at EUNetworks for Coloclue, and another team mate
installed the pair at DCG/NorthC, after which four of these units were left
written off and available. And I would not be me, if I were not to accept
some perfectly valuable second hand vintage Dells :-)
So the plan is: we drop an R610 here now, and I ship a replacement _standard
issue_ Supermicro + APU3d3/WiFi later.
### Connectivity
{{< image width="300px" float="right" src="/assets/network/frggh0-rack.png" alt="Lille Rack" >}}
By now I'm getting pretty good ad creating L2VPN EoMPLS circuits, so I
created one for myself from the Amsterdam router **er01.ams01.ip-max.net**
to the one here at **er01.lil01.ip-max.net**. The one here is a Cisco
ASR9006, a respectable machine. In the second point of presence, in
the neighboring town of Anzin, there is an ASR9010 called **er01.lil02.ip-max.net**
but that one is tucked away in a telco room reserved for deities, not
in the normal serverroom which is available for _plebs_ like me.
The inbound span goes from Amsterdam through Antwerp (Belgium) and Brussels
and finally landing in Anzin, and from there it's dark fiber to this rack.
I realised that because I use [UN/LOCODE](https://unece.org/trade/cefact/unlocode-code-list-country-and-territory)
that I should be precise in my naming. The town here is a suburb of Lille,
the country's fourth biggest city after Paris, Marseille and Lyon. The
town itself is called Sainghin-en-M&eacute;lantois, which resolves to
**FR GGH** and thus my temporary Dell R610 will be called **frggh0.ipng.ch**.
Fred happened to have a spare Intel X552 in his bag, so I commandeered it and
gave the machine two legs of Tengig, one going to the ASR9006 and the other
going the Nexus3064PQ under it. Soon, we will connect here to the local
internet exchange [Lillix](https://www.lillix.fr/membres/). There are very
few non-locals, let alone international members at LilleIX, but considering
larger clubs like Zayo require two or more french peering points, this will
be my ticket to some pretty good peers. Nice!
The idea is that one VLL lands me on Amsterdam, and the other will eventually
land me on Paris at Telehouse TH2. But that will be after the weekend, as
first we need to spend some quality time exploring Lille and recreating the
_grignotage_ (English: nibbling) that must be done in the north of France.
## The results
As always on the IP-Max network, they speak for themselves. During daytime,
the connectivity from my basement to Frankfurt is at 6.7ms, to Amsterdam
it's 13ms and all the way to Lille it's 20.3ms, with throughputs that
are, let's just say, _line rate_ **booyah**!
```
pim@chumbucket:~$ traceroute frggh0.ipng.ch
traceroute to frggh0.ipng.ch (194.1.163.34), 30 hops max, 60 byte packets
1 chbtl1.ipng.ch (194.1.163.67) 0.246 ms 0.214 ms 0.135 ms
2 chgtg0.ipng.ch (194.1.163.19) 0.513 ms 0.478 ms 0.445 ms
3 chrma0.ipng.ch (194.1.163.8) 0.633 ms 0.599 ms 0.658 ms
4 defra0.ipng.ch (194.1.163.25) 6.808 ms 6.773 ms 6.740 ms
5 nlams0.ipng.ch (194.1.163.27) 13.090 ms 13.057 ms 13.024 ms
6 frggh0.ipng.ch (194.1.163.34) 20.370 ms 20.550 ms 20.473 ms
pim@frggh0:~$ iperf3 -c chrma0.ipng.ch -P 10 -R
...
[SUM] 0.00-10.02 sec 8.84 GBytes 7.58 Gbits/sec 271 sender
[SUM] 0.00-10.00 sec 8.75 GBytes 7.52 Gbits/sec receiver
pim@defra0:~$ iperf3 -P 10 -c frggh0.ipng.ch -R
...
[SUM] 0.00-10.02 sec 11.1 GBytes 9.54 Gbits/sec 292 sender
[SUM] 0.00-10.00 sec 11.1 GBytes 9.51 Gbits/sec receiver
```
After the weekend, we'll be driving on to Paris to complete the ring, after
which I will have two different ways to traverse -- clockwise from
Zurich to Geneva, Paris, Lille, Amsterdam and Frankfurt, or counterclockwise
on the same ring. There will be 10Gbit between each of my routers in each
direction. We do not compromise on quality, throughput or latency over here.
I could not be happier with the service provided so far. Paris, here we come!!

View File

@ -0,0 +1,185 @@
---
date: "2021-06-01T18:16:42Z"
title: IPng arrives in Paris
---
I've been planning a network expansion for a while now. For the next few weeks,
I will be in total geek-mode as I travel to several European cities to deploy
AS50869 on a european ring. At the same time, my buddy Fred from
[IP-Max](https://ip-max.net/) has been wanting to go to Amsterdam. IP-Max's
[network](https://as25091.peeringdb.com/) is considerably larger than mine, but
it just never clicked with the right set of circumstances for them to deploy
in the Netherlands, until the stars aligned ...
## First a word
I just wanted to start with a note on how special it is to partner with [IP-Max](https://www.ip-max.net/)
and having known the founder Fred for so long affords me this specific trip.
It must be known that building a 10G pan-european ring is not an easy thing to
do, and I do appreciate very much the kindness Fred has shown IPng even though
we are under contract -- the ability to travel to every point of presence and
get the _founder's tour_ in each place, get to know the local DC ops, sales
directors, field technicians and local customers, is simply golden. Thank you.
## Deployment
{{< image width="300px" float="right" src="/assets/network/frpar0-rack.png" alt="Paris KDDI" >}}
The city of love, corny as it sounds, also happens to have quite a bit of _fibre
noire_, happily lit by hundreds of local and international carriers. There are
two places in the dead-center of the city, a _cabaret voltaire_ called Telehouse
TH2, and a new facility from [KDDI](https://fra.kddi.com/) which is in the same
physical block, aptly addressed as 65 Rue L&eacute;on Frot, 75011 Paris and that
is where my beloved router **frpar0.ipng.ch** will be. Note that in Lille,
actually I had to make do with **frggh0** which was in the town of
Sainghin-en-M&eacute;lantois. This one lives in Paris, none of that suburban
bullshit. This is the real deal. There's probably more connectivity in this
one block than in all of the Paris metro combined, maybe even more than
all of France combined -- peut &ecirc;tre :)
After having visited the older location, we took the router, APU, cables and
optics to L&eacute;on Frot. The rack was quickly found, and it is obvious that
this is the location: A _fridge_ was awaiting us, and I reserved two Tengig
ports on the ASR9010, one towards Lille and one towards Z&uuml;rich (which will be
Geneva later on).
I'm getting the hang of this VLL stuff after our adventures, previously in
[Lille]({% post_url 2021-05-28-lille %}) at CIV, which is a lazy 4.8ms away
from this place (Fred already speaks of making that more like 3.7ms with a
_small call_ to his buddy Laurent). So I went about my business, racking first
the WiFi enabled **console.par.ipng.nl**, connecting to WiFi with it, and
finding my router on **frpar0.ipng.ch** over IPMI serial-over-lan. Configuring
one Tengig port on the Intel X710 towards **er01.lil01.ip-max.net** and another
Tengig port towards **er01.zrh56.ip-max.net**.
I have two Supermicros on backorder, one of which will go to Lille to replace
the Dell R610 that I placed temporarily in that site, and the other will go to
Geneva. At that point, I will break up this VLL to become one from here to
Geneva and another from Geneva back home to Z&uuml;rich. Seriously though I will
have to stop Fred's enthusiasm because he also mentioned somthing about
[LyonIX](https://www.rezopole.net/fr/ixp/lyonix) and another small stopover of
1HE and 35W over there... this is addictive, save me from myself!!1
## Connectivity on the FLAP
All week, Fred has been talking about _the FLAP_, which I know to be a term
but I really never bothered to ask about it -- it turns out, he explained,
that it stands for **F**rankfurt, **L**ondon, **A**msterdam, and **P**aris.
I'm not in London (yet, please don't dare me ...), however I can legitimately
claim I am on _the FLAP_ because I have a router in **L**ille. So there's that :)
{{< image float="left" src="/assets/network/frpar0.png" alt="frpar0" >}}
Fred has ordered my FranceIX connection this afternoon, delivered from a
20Gig LAG on **er02.par02.ip-max.net** and directly into my router there. In
the mean time, I will be busy configuring my DE-CIX port from [a previous post]({% post_url 2021-05-17-frankfurt %}).
The console server here (a standard issue APU3 with 802.11ac WiFi broadcasting
_AS50869 PAR_ with password _IPngGuest_, you're welcome), connects to the
router with IPMI, while the router itself connects via USB serial back to the
APU for maximum resilience.
## A hard knock life
{{< image width="300px" float="right" src="/assets/network/happy-fred.png" alt="happy-fred" >}}
It was not without troubles today. When configuring my VLL to Zurich, I had
misconfigured the **er01.zrh56.ip-max.net** side (a Cisco 7600, which is on its
way out), and the VLL would not come up. I could see traffic going in one
direction but not in the other ... which typically does not make OSPF adjacencies
happen. After about an hour of messing around, I puppydog-eyed to Fred who
proceeded to find my bug within 30 seconds: I needed to do some VLAN gymnastics
by adding `rewrite ingress tag pop 1 symmetric` and as well adding 4 bytes
to the MTU (so `mtu 9018` total, cuz packets gotta be sourced directly from the
[jumbo-jumbo club](https://www.rheinfelderbierhalle.com/)!).
But, Fred was miserable as well because he had updated a Xen hypervisor which
ended up not being able to boot because of a broken LVM configuration. So we
literally swapped laptops and while he fixed my VLL, I fixed his volume group
by running `update-initramfs -u -k all` from a recovery Debian USB stick. For
an extra bonus, here's a picture of Happy Fred at the local Cafe Leopard, where
pretty much every day you can find a host of locals and international nerds who
have emerged from the server floor.
And Mael also helped with the serial port - I had put it into `115200` baud
but not pinned it at `8N1` for the APU serial console. It shot into life
as soon as he gave that tip and I committed the config, bravo!
I tend to believe it will not be necessary for me to physically visit the
facility that often -- simple hardware, no spinning disks, an APU connecting to
IPMI for full HTML5 based KVM control and serial-over-lan, and the router
exposing a console back to the APU, which has aan OOB network connection from
[AS25091](https://as25091.peeringdb.com/). Yeah, I think I'll be good.
## The results
But this was a special one indeed, because up until now, my traceroutes kept
on getting longer and longer as I deployed in Frankfurt, Amsterdam, and Lille.
Deploying in Paris therefore looked like this initially, as the packets took,
let us say, the scenic route to my basement:
```
pim@frpar0:~$ traceroute chumbucket.ipng.nl
traceroute to chumbucket.ipng.nl (194.1.163.93), 30 hops max, 60 byte packets
1 frggh0.ipng.ch (194.1.163.30) 4.915 ms 4.885 ms 4.866 ms
2 nlams0.ipng.ch (194.1.163.28) 12.396 ms 12.398 ms 12.382 ms
3 defra0.ipng.ch (194.1.163.26) 18.536 ms 18.520 ms 18.541 ms
4 chrma0.ipng.ch (194.1.163.24) 24.572 ms 24.557 ms 24.542 ms
5 chgtg0.ipng.ch (194.1.163.9) 24.549 ms 24.510 ms 24.517 ms
6 chbtl1.ipng.ch (194.1.163.18) 24.707 ms 25.114 ms 25.038 ms
7 chumbucket.ipng.nl (194.1.163.93) 25.320 ms 25.564 ms 25.452 ms
```
That's quite the scenic route indeed. But! On this glorious day, at exactly
16:34 UTC, the Tengig european IPv4 and IPv6 ring was closed, with one final set
of OSPF adjacencies:
```
pim@frpar0:~$ show protocols ospfv3 neighbor
Neighbor ID Pri DeadTime State/IfState Duration I/F[State]
194.1.163.34 1 00:00:37 Full/PointToPoint 01:40:51 dp0p6s0f0.100[PointToPoint]
194.1.163.1 1 00:00:33 Full/PointToPoint 00:01:28 dp0p6s0f1.100[PointToPoint]
```
Which allowed the ring to hone in on shortest path at its best - East bound to
Frankfurt and Amsterdam, and West bound to Paris and Lille. Link and equipment
failures will not bother me that much, OSPF and OSPFv3 will take care of
rerouting me around network problems, which considering the ASR9k at IP-Max,
I do think will be the exception:
```
pim@chumbucket:~$ traceroute frggh0.ipng.ch
traceroute to frggh0.ipng.ch (194.1.163.34), 30 hops max, 60 byte packets
1 chbtl1.ipng.ch (194.1.163.67) 0.317 ms 0.238 ms 0.190 ms
2 chgtg0.ipng.ch (194.1.163.19) 0.619 ms 0.574 ms 0.531 ms
3 frpar0.ipng.ch (194.1.163.40) 15.271 ms 15.226 ms 15.174 ms
4 frggh0.ipng.ch (194.1.163.34) 20.059 ms 20.020 ms 19.977 ms
pim@chumbucket:~$ traceroute nlams0.ipng.ch
traceroute to nlams0.ipng.ch (194.1.163.32), 30 hops max, 60 byte packets
1 chbtl1.ipng.ch (194.1.163.67) 0.345 ms 0.198 ms 0.284 ms
2 chgtg0.ipng.ch (194.1.163.19) 0.610 ms 0.518 ms 0.538 ms
3 chrma0.ipng.ch (194.1.163.8) 0.732 ms 0.750 ms 0.716 ms
4 defra0.ipng.ch (194.1.163.25) 6.835 ms 6.802 ms 6.767 ms
5 nlams0.ipng.ch (194.1.163.32) 12.799 ms 12.765 ms 12.731 ms
```
Of course with impeccable throughput, bien s&ucirc;r:
<iframe width="300" height="400" src="https://www.youtube.com/embed/9dLzGvXkqMI?autoplay=1&mute=0" style="width:300px; float: right; margin-left: 1em; margin-bottom: 1em;">
</iframe>
```
pim@frpar0:~$ iperf3 -c frggh0.ipng.ch -R
...
[ 5] 0.00-10.00 sec 10.7 GBytes 9.22 Gbits/sec 1 sender
[ 5] 0.00-10.00 sec 10.7 GBytes 9.22 Gbits/sec receiver
pim@frpar0:~$ iperf3 -c chgtg0.ipng.ch -R
...
[ 5] 0.00-10.01 sec 11.2 GBytes 9.42 Gbits/sec 1 sender
[ 5] 0.00-10.00 sec 11.2 GBytes 9.42 Gbits/sec receiver
```
I'm tired, but ultimately satisfied on taking my private AS50869 across
_the FLAP_, with a physical presence in each, and an IXP connection and
TenGig bidirectional, and TenGig IP Transit at each location. I think this
network is good to go for the next few years at least.

View File

@ -0,0 +1,480 @@
---
date: "2021-06-28T10:59:42Z"
title: Launch of AS112
---
I'm one of those people who is a fan of low-latency and high performance
distributed service architectures. After building out the IPng Network across
europe, I did notice a rather stark difference in presence of one particular
service: **AS112** anycast nameservers. In particular, I only have one Internet
Exchange in common with a direct presence of AS112, FCIX in California.
Big-up to the kind folks in Fremont who operate [www.as112.net](https://www.as112.net).
## The Problem
Looking around Switzerland, no internet exchanges actually have AS112 as a direct
member and as such you'll find the service tucked away behind several ISPs, with
AS paths such as `13030 29670 112`, `6939 112` and `34019 112`. A traceroute
from a popular swiss ISP, [Init7](https://init7.net/) will go to Germany, at a
roundtrip latency of 18.9ms. My own latency is 146ms as my queries are served
from FCIX:
```
pim@spongebob:~$ traceroute prisoner.iana.org
traceroute to prisoner.iana.org (192.175.48.1), 64 hops max, 40 byte packets
1 fiber7.xe8.chbtl0.ipng.ch (194.126.235.33) 2.658 ms 0.754 ms 0.523 ms
2 1790bre1.fiber7.init7.net (81.6.42.1) 1.132 ms 1.077 ms 3.621 ms
3 780eff1.fiber7.init7.net (109.202.193.44) 1.238 ms 1.162 ms 1.188 ms
4 r1win12.core.init7.net (77.109.181.155) 2.096 ms 2.1 ms 2.1 ms
5 r1zrh6.core.init7.net (82.197.168.222) 2.086 ms 3.904 ms 2.183 ms
6 r1glb1.core.init7.net (5.180.135.134) 2.043 ms 3.621 ms 2.088 ms
7 r2zrh2.core.init7.net (82.197.163.213) 2.353 ms 2.522 ms 2.289 ms
8 r2zrh2.core.init7.net (5.180.135.156) 2.08 ms 2.299 ms 2.202 ms
9 r1fra3.core.init7.net (5.180.135.173) 7.65 ms 7.582 ms 7.546 ms
10 r1fra2.core.init7.net (5.180.135.126) 7.928 ms 7.831 ms 7.997 ms
11 r1ber1.core.init7.net (77.109.129.8) 19.395 ms 19.287 ms 19.558 ms
12 octalus.in-berlin.a36.community-ix.de (185.1.74.3) 18.839 ms 18.717 ms 29.615 ms
13 prisoner.iana.org (192.175.48.1) 18.536 ms 18.613 ms 18.766 ms
pim@chumbucket:~$ traceroute blackhole-1.iana.org
traceroute to blackhole-1.iana.org (192.175.48.6), 30 hops max, 60 byte packets
1 chbtl1.ipng.ch (194.1.163.67) 0.247 ms 0.158 ms 0.107 ms
2 chgtg0.ipng.ch (194.1.163.19) 0.514 ms 0.474 ms 0.419 ms
3 usfmt0.ipng.ch (194.1.163.23) 146.451 ms 146.406 ms 146.364 ms
4 blackhole-1.iana.org (192.175.48.6) 146.323 ms 146.281 ms 146.239 ms
```
This path goes to FCIX because it's the only place where AS50869 picks up
AS112 directly, at an internet exchange, and therefore the localpref will
make this route preferred. But that's a long way to go for my DNS queries!
I think we can do better.
## Introduction
Taken from [RFC7534](https://datatracker.ietf.org/doc/html/rfc7534):
> Many sites connected to the Internet make use of IPv4 addresses that
> are not globally unique. Examples are the addresses designated in
> RFC 1918 for private use within individual sites.
>
> Devices in such environments may occasionally originate Domain Name
> System (DNS) queries (so-called "reverse lookups") corresponding to
> those private-use addresses. Since the addresses concerned have only
> local significance, it is good practice for site administrators to
> ensure that such queries are answered locally. However, it is not
> uncommon for such queries to follow the normal delegation path in the
> public DNS instead of being answered within the site.
>
> It is not possible for public DNS servers to give useful answers to
> such queries. In addition, due to the wide deployment of private-use
> addresses and the continuing growth of the Internet, the volume of
> such queries is large and growing. The AS112 project aims to provide
> a distributed sink for such queries in order to reduce the load on
> the corresponding authoritative servers. The AS112 project is named
> after the Autonomous System Number (ASN) that was assigned to it.
## Deployment
It's actually quite straight forward, the deployment consists of roughly
three steps:
1. Procure hardware to run the instances of the nameserver on.
1. Configure the nameserver to serve the zonefiles.
1. Announce the anycast service locally/regionally.
Let's discuss each in turn.
### Hardware
For the hardware, I've decided to use existing server platform at IP-Max
and IPng Networks. There are two types of hardware, both tried and tested,
one set is an HP ProLiant DL380 Gen9, and one is an older Dell PowerEdge R610.
Considering each vendor ships specific parts and each are different, many
appliance vendors choose to virtualize their environment such that the guest
operating system finds a very homogenous configuration. For my purposes,
the virtualization platform is Xen and the guest is a (para)virtualized
Debian.
I will be starting with three nodes, one in Geneva and one in Zurich, hosted
on hypervisors of [IP-Max](https://www.ip-max.net/), and one in Amsterdam,
hosted on a hypervisor of [IPng](https://ipng.ch/). I have a feeling a few
more places will follow.
#### Install the OS
Xen makes this repeatable and straight forward. Other systems, such as KVM,
have very similar installers, for example VMBuilder is popular. Both work
roughly the same way, and install a guest in a matter of minutes.
I'll install to an LVS volume group on all machines, backed by pairs of SSD
for throughput and redundancy. We'll give the guest 4GB of memory and 4
CPUs. I love how the machine boots using [PyGrub](https://wiki.debian.org/PyGrub),
fully on serial, and is fully booted and running in 20 seconds.
```
sudo xen-create-image --hostname as112-1.free-ix.net --ip 46.20.249.197 \
--vcpus 4 --pygrub --dist buster --lvm=vg1_hvn04_gva20
sudo xl create -c as112-1.free-ix.net.cfg
```
After logging in, the following additional software was installed. We'll be
using [Bird2](https://bird.network.cz/), which comes on Debian Buster's
backports. Otherwise, we're pretty vanilla:
```
$ cat << EOF | sudo tee -a /etc/apt/sources.list
#
# Backports
#
deb http://deb.debian.org/debian buster-backports main
EOF
$ sudo apt update
$ sudo apt install tcpdump sudo net-tools bridge-utils nsd bird2 \
netplan.io traceroute ufw curl bind9-dnsutils
$ sudo apt purge ifupdown
```
I removed the `/etc/network/interfaces` approach and configured Netplan,
a personal choice, which aligns the machines more closely with other servers
in the IPng fleet. The only trick is to ensure that the anycast IP addresses
are available for the nameserver to listen on, so at the top of Netplan's
configuration file, we add them like so:
```
network:
version: 2
renderer: networkd
ethernets:
lo:
addresses:
- 127.0.0.1/8
- ::1/128
- 192.175.48.1/32 # prisoner.iana.org (anycast)
- 2620:4f:8000::1/128 # prisoner.iana.org (anycast)
- 192.175.48.6/32 # blackhole-1.iana.org (anycast)
- 2620:4f:8000::6/128 # blackhole-1.iana.org (anycast)
- 192.175.48.42/32 # blackhole-2.iana.org (anycast)
- 2620:4f:8000::42/128 # blackhole-2.iana.org (anycast)
- 192.31.196.1/32 # blackhole.as112.arpa (anycast)
- 2001:4:112::1/128 # blackhole.as112.arpa (anycast)
```
### Nameserver
My nameserver of choice is [NSD](https://www.nlnetlabs.nl/projects/nsd/about/),
and its configuration is similar to BIND, which is described in RFC7534. In
fact, the zone files are identical, so all we should do is create a few listen
statements and load up the zones:
```
$ cat << EOF | sudo tee /etc/nsd/nsd.conf.d/listen.conf
server:
ip-address: 127.0.0.1
ip-address: ::1
ip-address: 46.20.249.197
ip-address: 2a02:2528:a04:202::197
ip-address: 192.175.48.1 # prisoner.iana.org (anycast)
ip-address: 2620:4f:8000::1 # prisoner.iana.org (anycast)
ip-address: 192.175.48.6 # blackhole-1.iana.org (anycast)
ip-address: 2620:4f:8000::6 # blackhole-1.iana.org (anycast)
ip-address: 192.175.48.42 # blackhole-2.iana.org (anycast)
ip-address: 2620:4f:8000::42 # blackhole-2.iana.org (anycast)
ip-address: 192.31.196.1 # blackhole.as112.arpa (anycast)
ip-address: 2001:4:112::1 # blackhole.as112.arpa (anycast)
server-count: 4
EOF
$ cat << EOF | sudo tee /etc/nsd/nsd.conf.d/as112.conf
zone:
name: "hostname.as112.net"
zonefile: "/etc/nsd/master/db.hostname.as112.net"
zone:
name: "hostname.as112.arpa"
zonefile: "/etc/nsd/master/db.hostname.as112.arpa"
zone:
name: "10.in-addr.arpa"
zonefile: "/etc/nsd/master/db.dd-empty"
# etcetera
EOF
```
While all of the zones are captured by `db.dd-empty` or `db.dr-empty`, which
can be found in the RFC text, I'll note the top two are special, as they are
specific to the instance. For example on our Geneva instance:
```
$ cat << EOF | sudo tee /etc/nsd/master/db.hostname.as112.arpa
$TTL 1W
@ SOA chplo01.paphosting.net. noc.ipng.ch. (
1 ; serial number
1W ; refresh
1M ; retry
1W ; expire
1W ) ; negative caching TTL
NS blackhole.as112.arpa.
TXT "AS112 hosted by IPng Networks" "Geneva, Switzerland"
TXT "See https://www.as112.net/ for more information."
TXT "See https://free-ix.net/ for local information."
TXT "Unique IP: 194.1.163.147"
TXT "Unique IP: [2001:678:d78:7::147]"
LOC 46 9 55.501 N 6 6 25.870 E 407.00m 10m 100m 10m
```
This is super helpful to users, who want to know which server, exactly,
is serving their request. Not all operators added the `Unique IP` details,
but I found it useful when launching the service, as several anycast nodes
quickly become confusing otherwise :-)
After this is all done, the nameserver can be started. I rebooted the guest
for good measure, and about 19 seconds later (a fact that continues to
amaze me), the server was up and serving queries, albeit only from localhost
because there is no way to reach the server on the network, yet.
To validate things work, we can do a few SOA or TXT queries, like this one:
```
pim@nlams01:~$ ping -c5 -q prisoner.iana.org
PING prisoner.iana.org(prisoner.iana.org (2620:4f:8000::1)) 56 data bytes
--- prisoner.iana.org ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 34ms
rtt min/avg/max/mdev = 0.041/0.045/0.053/0.004 ms
pim@nlams01:~$ dig @prisoner.iana.org hostname.as112.net TXT +short +norec
"AS112 hosted by IPng Networks" "Amsterdam, The Netherlands"
"See http://www.as112.net/ for more information."
"Unique IP: 94.142.241.187"
"Unique IP: [2a02:898:146::2]"
```
### Network
Now comes the fun part! We're running these instances of the nameservers
in a few locations, and to ensure we don't route traffic to the incorrect
location, we'll announce them using BGP as per recommendation of RFC7534.
My choice of routing suite is [Bird2](https://bird.network.cz/), which comes
with a lot of extensiblility and a programmatic validation of routing policies.
We'll only be using `static` and `BGP` routing protocols for Bird, so the
configuration is relatively straight forward, first we create a routing table
export for IPv4 and IPv6, then we define some static _Nullroutes_, which ensure
that our prefixes are always present in the RIB (otherwise BGP will not export
them), then we create some filter functions (one for routeserver sessions,
one for peering sessions, and one for transit sessions), and finally we include
a few specific configuration files, one-per-environment where we'll be active.
```
$ cat << EOF | sudo tee /etc/bird/bird.conf
router id 46.20.249.197;
protocol kernel fib4 {
ipv4 { export all; };
scan time 60;
}
protocol kernel fib6 {
ipv6 { export all; };
scan time 60;
}
protocol static static_as112_ipv4 {
ipv4;
route 192.175.48.0/24 blackhole;
route 192.31.196.0/24 blackhole;
}
protocol static static_as112_ipv6 {
ipv6;
route 2620:4f:8000::/48 blackhole;
route 2001:4:112::/48 blackhole;
}
include "bgp-freeix.conf";
include "bgp-ipng.conf";
include "bgp-ipmax.conf";
EOF
```
The configuration file per environment, say `bgp-freeix.conf`, can (and will)
be autogenerated, but the pattern is of the following form:
```
$ cat << EOF | tee /etc/bird/bgp-freeix.conf
#
# Bird AS112 configuration for FreeIX
#
define my_ipv4 = 185.1.205.252;
define my_ipv6 = 2001:7f8:111:42::70:1;
protocol bgp freeix_as51530_1_ipv4 {
description "FreeIX - AS51530 - Routeserver #1";
local as 112;
source address my_ipv4;
neighbor 185.1.205.254 as 51530;
ipv4 {
import where fn_import_routeserver( 51530 );
export where proto = "static_as112_ipv4";
import limit 120000 action restart;
};
}
protocol bgp freeix_as51530_1_ipv6 {
description "FreeIX - AS51530 - Routeserver #1";
local as 112;
source address my_ipv6;
neighbor 2001:7f8:111:42::c94a:1 as 51530;
ipv6 {
import where fn_import_routeserver( 51530 );
export where proto = "static_as112_ipv6";
import limit 120000 action restart;
};
}
# etcetera
EOF
```
If you've seen IXPManager's approach to routeserver configuration generators,
you'll notice I borrowed the `fn_import()` function and its dependents from
there. This allows imports to be specific towards prefix-lists, as-paths and
ensure some _Belts and Braces_ checks are in place (no invalid or tier1 ASN
in the path, a valid nexthop, no tricks with AS path truncation, and so on).
After bringing up the service, the prefixes make their way into the
routeserver and get distributed to the FreeIX participants:
```
$ sudo systemctl start bird
$ sudo birdc show protocol
BIRD 2.0.7 ready.
Name Proto Table State Since Info
fib4 Kernel master4 up 2021-06-28 11:01:35
fib6 Kernel master6 up 2021-06-28 11:01:35
device1 Device --- up 2021-06-28 11:01:35
static_as112_ipv4 Static master4 up 2021-06-28 11:01:35
static_as112_ipv6 Static master6 up 2021-06-28 11:01:35
freeix_as51530_1_ipv4 BGP --- up 2021-06-28 11:01:17 Established
freeix_as51530_1_ipv6 BGP --- up 2021-06-28 11:01:19 Established
freeix_as51530_2_ipv4 BGP --- up 2021-06-28 11:01:32 Established
freeix_as51530_2_ipv6 BGP --- up 2021-06-28 11:01:37 Established
```
#### Internet Exchanges
Having one configuration file per group helps a lot with integration of
[IXPManager](https://www.ixpmanager.org/) where we might autogenerate the IXP
versions of these files and install them periodically. That way, when members
enable the `AS112` peering checkmark, the servers will automatically download
and set up those sessions without human involvement -- typically this is the
best way to avoid outages: never tinker with production config files by hand.
We'll test this out with [FreeIX](https://free-ix.net/), but hope as well to
offer our service to other internet exchanges, notably SwissIX and CIXP.
One of the huge benefits of operating within [IP-Max](https://ip-max.net/)
network is their ability to do L2VPN transport from any place on-net to any
other router. As such, connecting these virtual machines to other places,
like SwissIX, CIXP, CHIX-CH, Community-IX or other further away places,
is a piece of cake. All we must do is create an L2VPN and offer it to the
hypervisor (which usually is connected via a LACP _BundleEthernet_) on some
VLAN, after which we can bridge that into the guest OS by creating a new
virtio NIC. This is how, in the example above, our AS112 machines were
introduced to FreeIX. This scales very well, requiring only one guest reboot
per internet exchange, and greatly simplifies operations.
### Monitoring
Of course, one would not want to run a production service, certainly not
on the public internet, without a bit of introspection and monitoring.
There are four things we might want to ensure:
1. Is the machine up and healthy? For this we use NAGIOS.
1. Is NSD serving? For this we use NSD Exporter and Prometheus/Grafana.
1. Is NSD reachable? For this we use CloudProber.
1. If there is an issue, can we alert an operator? For this we use Telegram.
In a followup post, I'll demonstrate how these things come together into
a comprehensive anycast monitoring and alerting solution. As a fringe benefit
we can show contemporary graphs and dashboards. But seeing as the service
hasn't yet gotten a lot of mileage, it deserves its own followup post, some
time in August.
## The results
First things first - latency went waaaay down:
```
pim@chumbucket:~$ traceroute blackhole-1.iana.org
traceroute to blackhole-1.iana.org (192.175.48.6), 30 hops max, 60 byte packets
1 chbtl1.ipng.ch (194.1.163.67) 0.257 ms 0.199 ms 0.159 ms
2 chgtg0.ipng.ch (194.1.163.19) 0.468 ms 0.430 ms 0.430 ms
3 chrma0.ipng.ch (194.1.163.8) 0.648 ms 0.611 ms 0.597 ms
4 blackhole-1.iana.org (192.175.48.6) 1.272 ms 1.236 ms 1.201 ms
pim@chumbucket:~$ dig -6 @prisoner.iana.org hostname.as112.net txt +short +norec +tcp
"Free-IX hosted by IP-Max SA" "Zurich, Switzerland"
"See https://www.as112.net/ for more information."
"See https://free-ix.net/ for local information."
"Unique IP: 46.20.246.67"
"Unique IP: [2a02:2528:1703::67]"
```
and this demonstrates why it's super useful to have the `hostname.as112.net`
entry populated well. If I'm in Amsterdam, I'll be served by the local node there:
```
pim@gripe:~$ traceroute6 blackhole-2.iana.org
traceroute6 to blackhole-2.iana.org (2620:4f:8000::42), 64 hops max, 60 byte packets
1 nlams0.ipng.ch (2a02:898:146::1) 0.744 ms 0.879 ms 0.818 ms
2 blackhole-2.iana.org (2620:4f:8000::42) 1.104 ms 1.064 ms 1.035 ms
pim@gripe:~$ dig -4 @prisoner.iana.org hostname.as112.net txt +short +norec +tcp
"Hosted by IPng Networks" "Amsterdam, The Netherlands"
"See http://www.as112.net/ for more information."
"Unique IP: 94.142.241.187"
"Unique IP: [2a02:898:146::2]"
```
Of course, due to anycast, and me being in Zurich, I will be served primarily
by the Zurich node. If it were to go down for maintenance, or hardware failure,
BGP will immediately converge on alternate paths, there are currently three
to choose from:
```
pim@chrma0:~$ show protocols bgp ipv4 unicast 192.31.196.0/24
BGP routing table entry for 192.31.196.0/24
Paths: (10 available, best #2, table default)
Advertised to non peer-group peers:
185.1.205.251 194.1.163.1 [...]
112
194.1.163.32 (metric 137) from 194.1.163.32 (194.1.163.32)
Origin IGP, localpref 400, valid, internal
Community: 50869:3500 50869:4099 50869:5055
Last update: Mon Jun 28 11:13:14 2021
112
185.1.205.251 from 185.1.205.251 (46.20.246.67)
Origin IGP, localpref 400, valid, external, bestpath-from-AS 112, best (Local Pref)
Community: 50869:3500 50869:4099 50869:5000 50869:5020 50869:5060
Last update: Mon Jun 28 11:00:45 2021
112
185.1.205.251 from 185.1.205.253 (185.1.205.253)
Origin IGP, localpref 200, valid, external
Community: 50869:1061
Last update: Mon Jun 28 11:00:20 2021
(and more)
```
I am expecting a few more direct paths to come, as I harden this service, and
offer it to other swiss internet exchange points in the future. But mostly, my
mission of reducing the round trip time from 146ms to 1ms from my desktop at
home was successfully accomplished.

View File

@ -0,0 +1,210 @@
---
date: "2021-07-03T22:16:44Z"
title: IPng arrives in Geneva
---
I've been planning a network expansion for a while now. For the next few weeks,
I will be in total geek-mode as I travel to several European cities to deploy
AS50869 on a european ring. At the same time, my buddy Fred from
[IP-Max](https://ip-max.net/) has been wanting to go to Amsterdam. IP-Max's
[network](https://as25091.peeringdb.com/) is considerably larger than mine, but
it just never clicked with the right set of circumstances for them to deploy
in the Netherlands, until the stars aligned ...
## Deployment
{{< image width="400px" float="left" src="/assets/network/qdr.png" alt="Quai du Rh&ocirc;ne" >}}
After our adventure in [Frankfurt]({% post_url 2021-05-17-frankfurt %}),
[Amsterdam]({% post_url 2021-05-26-amsterdam %}), [Lille]({% post_url 2021-05-28-lille %}),
and [Paris]({% post_url 2021-06-01-paris %}) came to an end, I still had a few
loose ends to tie up. In particular, in Lille I had dropped an old Dell R610
while waiting for new Supermicros to be delivered. There is benefit to having
one standard footprint setup, in my case an PCEngines `APU2`, Supermicro
`5018D-FN8T` and Intel `X710-DA4` expansion NIC. They run fantastic with
[DANOS](https://danosproject.org/) and [VPP](https://fd.io/) applications.
Of course, we mustn't forget _home base_, Geneva, where IP-Max has its
headquarters in a beautiful mansion pictured here. At the same time, my
family likes to take one trip per month to a city we don't usually go, sort
of to keep up with real life as we are now more and more able to travel.
Marina has a niece in Geneva, who has lived and worked there for 20+ years,
so we figured we'd combine these things and stay the weekend at her place.
After making our way from Zurich to Geneva, a trip that took us just short
of six hours (!) by car, we arrived at the second half of the Belgium:Italy
eurocup soccer match. It was perhaps due to our tardiness and lack of physical
supportering, that the belgians lost the match that day. Sorry!
### Connectivity
{{< image width="400px" float="right" src="/assets/network/chplo0-rack.png" alt="Geneva Rack" >}}
My current circuit runs from Paris (Leon Frot), `frpar0.ipng.ch` over a direct
DWDM wave to Zurich where I pick it up on `chgtg0.ipng.ch` at Interxion
Glattbrugg. So what we'll do is break open this VLL at the IP-Max side,
insert the new router `chplo0.ipng.ch`, and reconfigure the Paris side
to go to the new router, and the new router to create another VLL back
to Zurich, which due to the toplogy of IP-Max's underlying DWDM network
will traverse Paris - Lyon - Geneva instead (shaving off ~1.5ms of latency
at the same time).
I hung up the `APU2` OOB server and the `5018D-FN8T` router, and another Dell
R610 to run virtual machines at Safehost SH1 in Plan-les-Ouates, a southern
suburb of Geneva. I connected one 10G port to `er01.gva20.ip-max.net` and
another 10G port to `er02.gva20.ip-max.net` to obtain maximum availability
benefits. As an example of what the configuration on the ASR9k platform looks
like for this type of operation, here's what I committed on `er01.gva20`.
Of course, first things first: let's ensure that the OOB machine has
connectivity, by allocating a /64 IPv6 and /29 IPv4. I usually configure
myself a BGP transit session in the same subnet, which means we'll want to
bridge the 1G UTP connection of the APU with the 10G fiber connection of
the Supermicro router, like so:
```
interface BVI911
description Cust: IPng OOB and Transit
ipv4 address 46.20.250.105 255.255.255.248
ipv4 unreachables disable
ipv6 nd suppress-ra
ipv6 address 2a02:2528:ff05::1/64
ipv6 enable
load-interval 30
!
interface GigabitEthernet0/7/0/38
description Cust: IPng APU (OOB)
mtu 9064
load-interval 30
l2transport
!
!
interface TenGigE0/1/0/3
description Cust: IPng (VLL and Transit)
mtu 9014
!
interface TenGigE0/1/0/3.911 l2transport
encapsulation dot1q 911 exact
rewrite ingress tag pop 1 symmetric
mtu 9018
!
l2vpn
bridge group BG_IPng
bridge-domain BD_IPng911
interface Te0/1/0/3.911
!
interface GigabitEthernet0/7/0/38
!
routed interface BVI911
!
!
!
```
After this, we pulled UTP cable and configured the `APU2`, which then has an
internal network towards the IPMI port of the Supermicro, and from there on,
the configuration becomes much easier. Of course, all config can be done
wirelessly, because the APU `console.plo.ipng.nl` acts as a WiFi access
point, so I connect to it and commit the network configs.
Once that's online and happy, the router `chplo0.ipng.ch` is next. For this,
on `er02.par02.ip-max.net`, I reconfigure the current VLL to point to the
loopback of this router `er01.gva20.ip-max.net` using the same `pw-id`. Then,
I can configure this router as follows:
```
interface TenGigE0/1/0/3.100 l2transport
description Cust: IPng VLL to par02
encapsulation dot1q 100
rewrite ingress tag pop 1 symmetric
mtu 9018
!
l2vpn
pw-class EOMPLS-PW-CLASS
encapsulation mpls
transport-mode ethernet
!
!
xconnect group IPng
p2p IPng_to_par02
interface TenGigE0/1/0/3.100
neighbor ipv4 46.20.255.33 pw-id 210535705
pw-class EOMPLS-PW-CLASS
!
!
!
```
## The results
And with that, the pseudowire is constructed, and the original interface on
`frpar0.ipng.ch` directly sees the interface here on `chplo0.ipng.ch` using
jumboframes of 9000 bytes (+14 bytes of ethernet overhead and +4 bytes of VLAN
tag on the ingress interface). It is as if the routers are directly connected
by a very long ethernet cable, a _pseudo-wire_ if you wish. Super low pingtimes
are observed between this new router in Geneva and the existing two in Paris
and Zurich:
```
pim@chplo0:~$ /bin/ping -4 -c5 frpar0
PING frpar0.ipng.ch (194.1.163.33) 56(84) bytes of data.
64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=1 ttl=64 time=8.78 ms
64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=2 ttl=64 time=8.80 ms
64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=3 ttl=64 time=8.81 ms
64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=4 ttl=64 time=8.82 ms
64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=5 ttl=64 time=8.85 ms
--- frpar0.ipng.ch ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 10ms
rtt min/avg/max/mdev = 8.783/8.810/8.846/0.104 ms
pim@chplo0:~$ /bin/ping -6 -c5 chgtg0
PING chgtg0(chgtg0.ipng.ch (2001:678:d78::1)) 56 data bytes
64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=1 ttl=64 time=4.51 ms
64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=2 ttl=64 time=4.44 ms
64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=3 ttl=64 time=4.36 ms
64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=4 ttl=64 time=4.47 ms
64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=5 ttl=64 time=4.41 ms
--- chgtg0 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 10ms
rtt min/avg/max/mdev = 4.362/4.436/4.506/0.077 ms
```
For good measure I've also connected to FreeIX, a new internet exchange project
I'm working on, that will span the Geneva, Zurich and Lugano areas. More on that
in a future post!
```
pim@chplo0:~$ iperf3 -4 -c 185.1.205.1 ## chgtg0.ipng.ch
Connecting to host 185.1.205.1, port 5201
[ 5] local 185.1.205.2 port 46872 connected to 185.1.205.1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 809 MBytes 6.78 Gbits/sec 4 11.4 MBytes
[ 5] 1.00-2.00 sec 869 MBytes 7.29 Gbits/sec 0 11.4 MBytes
[ 5] 2.00-3.00 sec 865 MBytes 7.25 Gbits/sec 0 11.4 MBytes
[ 5] 3.00-4.00 sec 868 MBytes 7.28 Gbits/sec 0 11.4 MBytes
[ 5] 4.00-5.00 sec 836 MBytes 7.01 Gbits/sec 0 11.4 MBytes
[ 5] 5.00-6.00 sec 852 MBytes 7.15 Gbits/sec 0 11.4 MBytes
[ 5] 6.00-7.00 sec 865 MBytes 7.26 Gbits/sec 0 11.4 MBytes
[ 5] 7.00-8.00 sec 865 MBytes 7.26 Gbits/sec 0 11.4 MBytes
[ 5] 8.00-9.00 sec 861 MBytes 7.22 Gbits/sec 0 11.4 MBytes
[ 5] 9.00-10.00 sec 860 MBytes 7.22 Gbits/sec 0 11.4 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 8.35 GBytes 7.17 Gbits/sec 4 sender
[ 5] 0.00-10.01 sec 8.35 GBytes 7.16 Gbits/sec receiver
iperf Done.
```
You kind of get used to performance stats like this, but that said, it's nice
to see that performance over FreeIX is slightly *lower* than performance over
the IPng backbone, and this is because on my VLLs, I can make use of jumbo
frames, which gives me 20% or so better performance (currently 9.62 Gbits/sec).
Currently I'm busy at work in the background completing the configuration, the
management environment and physical infrastructure for the internet exchange.
I'm planning to make a more complete post about the FreeIX project in a few
weeks once it's ready for launch. Stay tuned!

View File

@ -0,0 +1,201 @@
---
date: "2021-07-19T16:12:54Z"
title: 'Review: PCEngines APU6 (with SFP)'
---
* Author: Pim van Pelt <[pim@ipng.nl](mailto:pim@ipng.nl)>
* Reviewed: Pascal Dornier <[pdornier@pcengines.ch](mailto:pdornier@pcengines.ch)>
* Status: Draft - Review - **Approved**
I did this test back in February, but can now finally publish the results! This little SBC is
definitely going to be a hit in the ISP industry. See more information about it [here](https://www.pcengines.ch/apu6b4.htm).
[PC Engines](https://pcengines.ch/) develops and sells small single board computers for networking
to a worldwide customer base. This article discusses a new/unreleased product which PC Engines has
developed, which has specific significance in the network operator community: an SBC which comes
with three RJ45/UTP based network ports, and one SFP optical port.
# Executive Summary
Due to the use of Intel i210-IS on the SFP port and i211-AT on the three copper ports, and due to
it having no moving parts (fans, hard disks, etc), this SBC is an excellent choice for network
appliances such as out-of-band or serial consoles in a datacenter, or routers in a small business
or home office.
## Detailed findings
{{< image width="300px" float="right" src="/assets/pcengines-apu6/apu6.png" alt="APU6" >}}
The [APU](https://www.pcengines.ch/apu2.htm) series boards typically ship with 2GB or 4GB of DRAM,
2, 3 or 4 Intel i211-AT network interfaces, and a four core AMD GX-412TC (running at 1GHz). This
review is about the following **APU6** unit, which comes with 4GB of DRAM (this preproduction unit
came with 2GB, but that will be fixed in the production version), 3x i211-AT for the RJ45
network interfaces, and one i210-IS with an SFP cage.
One other significant difference is visible -- the trusty rusty DB9 connector that exposes the first
serial RS232 port is replaced with a modern CP2104 (USB vendor `10c4:ea60`) from Silicon Labs which
exposes the serial port as TTL/serial on a micro USB connector rather than RS232, neat!
## Transceiver Compatibility
{{< image float="right" src="/assets/pcengines-apu6/optics.png" alt="Optics" >}}
The small form-factor pluggable (SFP) is a compact, hot-pluggable network interface module used for
both telecommunication and data communications applications. An SFP interface on networking hardware
is a modular slot for a media-specific transceiver in order to connect a fiber-optic cable or
sometimes a copper cable. Such a slot is typically called a _cage_.
The SFP port accepts most/any optics brand and configuration (Copper, regular 850nm/1310nm/1550nm
based, BiDi as commonly used in FTTH deployments, CWDM for use behind an OADM). I tried 6 different
vendors and types, see below for results. All modules worked, regardless of vendor or brand.
I tried 6 different SFP modules, all successfully. See the links in the list for an output of an
optical diagnostics tool (using the SFF-8472 standard for SFP/SFP+ management).
Each module provided link and passed traffic. The loadtest below was done with the BiDi optics
in one interface and a boring RJ45 copper cable in another. It's going to be fantastic to be able
to use these APU6's in a datacenter setting as remote / out-of-band serial devices, specifically
nowadays where UTP is becoming a scarcity and everybody has fiber infrastructure in their racks.
Vendor | Type | Description | Details
------- | ----------------- | ------------------------------- | ------------------
Finisar | FTLF8519P2BNL-RB | 850nm duplex | [sfp0.txt](/assets/pcengines-apu6/sfp0.txt)
Generic | Unknown(no DOM) | 850nm duplex | [sfp1.txt](/assets/pcengines-apu6/sfp1.txt)
Cisco | GLC-LH-SMD | 1310nm duplex | [sfp2.txt](/assets/pcengines-apu6/sfp2.txt)
Cisco | SFP-GE-BX-D | 1490nm Bidirectional (FTTH CPE) | [sfp3.txt](/assets/pcengines-apu6/sfp3.txt)
Cisco | SFP-GE-BX-U | 1310nm Bidirectional (FTTH COR) | [sfp3.txt](/assets/pcengines-apu6/sfp3.txt)
Cisco | BT-OC24-20A | 1550nm OC24 SDH | [sfp4.txt](/assets/pcengines-apu6/sfp4.txt)
Finisar | FTRJ1319P1BTL-C7 | 1310nm 20km (w/ 6dB attenuator) | [sfp5.txt](/assets/pcengines-apu6/sfp5.txt)
## Network Loadtest
The choice of Intel i210/i211 network controller on this board allows operators to use Intel's
DPDK with relatively high performance, compared to regular (kernel) based routing. I loadtested
Linux (Ubuntu 20.04), OpenBSD (6.8), and two lesser known but way cooler DPDK open source
appliances called Danos ([ref](https://www.danosproject.org/)) and VPP ([ref](https://fd.io/))
respectively.
Specifically worth calling out that while Linux and OpenBSD struggled, both DPDK appliances had
absolutely no problems filling a bidirectional gigabit stream of "regular internet traffic"
(referred to as `imix`), and came close to _line rate_ with "64b UDP packets". The line rate of
a gigabit ethernet is 1.48Mpps in one direction, and my loadtests stressed both directions
simultaneously.
### Methodology
For the loadtests, I used Cisco's T-Rex ([ref](https://trex-tgn.cisco.com/)) in stateless mode,
with a custom Python controller that ramps up and down traffic from the loadtester to the device
under test (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be
presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back
in on `port0`). The loadtester first sends a few seconds of warmup, this is to ensure the DUT is
passing traffic and offers the ability to inspect the traffic before the actual rampup. Then
the loadteser ramps up linearly from zero to 100% of line rate (in our case, line rate is
one gigabit in both directions), finally it holds the traffic at full line rate for a certain
duration. If at any time the loadtester fails to see the traffic it's emitting return on its
second port, it flags the DUT as saturated; and this is noted as the maximum bits/second and/or
packets/second.
```
usage: trex-loadtest.bin [-h] [-s SERVER] [-p PROFILE_FILE] [-o OUTPUT_FILE] [-wm WARMUP_MULT]
[-wd WARMUP_DURATION] [-rt RAMPUP_TARGET]
[-rd RAMPUP_DURATION] [-hd HOLD_DURATION]
T-Rex Stateless Loadtester -- pim@ipng.nl
optional arguments:
-h, --help show this help message and exit
-s SERVER, --server SERVER
Remote trex address (default: 127.0.0.1)
-p PROFILE_FILE, --profile PROFILE_FILE
STL profile file to replay (default: imix.py)
-o OUTPUT_FILE, --output OUTPUT_FILE
File to write results into, use "-" for stdout (default: -)
-wm WARMUP_MULT, --warmup_mult WARMUP_MULT
During warmup, send this "mult" (default: 1kpps)
-wd WARMUP_DURATION, --warmup_duration WARMUP_DURATION
Duration of warmup, in seconds (default: 30)
-rt RAMPUP_TARGET, --rampup_target RAMPUP_TARGET
Target percentage of line rate to ramp up to (default: 100)
-rd RAMPUP_DURATION, --rampup_duration RAMPUP_DURATION
Time to take to ramp up to target percentage of line rate, in seconds (default: 600)
-hd HOLD_DURATION, --hold_duration HOLD_DURATION
Time to hold the loadtest at target percentage, in seconds (default: 30)
```
It's worth pointing out that almost all systems are _pps-bound_ not _bps-bound_. A typical rant
I have is that network vendors are imprecise when they specify their throughput "up to 40Gbit"
they more often than not mean "under carefully crafted conditions" such as utilizing jumboframes
(9216 bytes rather than "usual" 1500 byte MTU found on ethernet, which is easier on the router
than a typical internet mixture (closer to 1100 bytes), and much easier yet than if the router
is asked to forward 64 byte packets, for instance in a DDoS attack); and only in one direction;
and only using exactly one source/destination IP address/port, which is a little bit easier to
do than to look up a destination in a forwarding table containing 1M destinations -- for context
a current internet backbone router carries ~845K IPv4 destinations and ~105K IPv6 destinations.
### Results
Product | Loadtest | Throughput (pps) | Throughput (bps) | % of linerate | Details
------- | -------- | ---------------- | ------------------| ------------- | ----------
Linux | imix | 150.21 Kpps | 452.81 Mbps | 45.28% | [apu6-linux-imix.json](/assets/pcengines-apu6/apu6-linux-imix.json)
OpenBSD | imix | 145.52 Kpps | 444.51 Mbps | 44.45% | [apu6-openbsd-imix.json](/assets/pcengines-apu6/apu6-openbsd-imix.json)
**VPP** | **imix** | **654.40 Kpps** | **2.00 Gbps** | **199.90%** | [apu6-vpp-imix.json](/assets/pcengines-apu6/apu6-vpp-imix.json)
**Danos** | **imix** | **655.53 Kpps** | **2.00 Gbps** | **200.24%** | [apu6-danos-imix.json](/assets/pcengines-apu6/apu6-danos-imix.json)
Linux | 64b | 96.93 Kpps | 65.14 Mbps | 6.51% | [apu6-linux-64b.json](/assets/pcengines-apu6/apu6-linux-64b.json)
OpenBSD | 64b | 152.09 Kpps | 102.20 Mbps | 10.22% | [apu6-openbsd-64b.json](/assets/pcengines-apu6/apu6-openbsd-64b.json)
VPP | 64b | 1.78 Mpps | 1.19 Gbps | 119.49% | [apu6-vpp-64b.json](/assets/pcengines-apu6/apu6-vpp-64b.json)
**Danos** | **64b** | **2.30 Mpps** | **1.55 Gbps** | **154.62%** | [apu6-danos-64b.json](/assets/pcengines-apu6/apu6-danos-64b.json)
{{< image width="800px" src="/assets/pcengines-apu6/results.png" alt="Results" >}}
For more information on the methodology and the scripts that drew these graphs, take a look
at my buddy Michal's [GitHub Page](https://github.com/wejn/trex-loadtest-viz), which, given
time, will probably turn into its own subsection of this website (I can only imagine the value
of a corpus of loadtests of popular equipment in the consumer arena).
## Caveats
The unit was shipped to me free of charge by PC Engines for the purposes of load- and systems
integration testing. Other than that, this is not a paid endorsement and views of this review
are my own.
## Open Questions
### SFP I2C
Considering the target audience, I wonder if there is a possibility to break out the I2C pins from
the SFP cage into a header on the board, so that users can connect them through to the CPU's I2C
controller (or bitbang directly on GPIO pins), and use the APU6 as an SFP flasher. I think that
would come in incredibly handy in a datacenter setting.
### CPU bound
The DPDK based router implementations are CPU bound, and could benefit from a little bit more power.
I am duly impressed by the throughput seen in terms of packets/sec/watt, but considering a typical
router has a (forwarding) dataplane and needs as well a (configuration) controlplane, we are short
about 30% CPU cycles. If a controlplane (like Bird or FRR ([ref](https://frrouting.org)) is dedicated
one core, that leaves us three cores for forwarding, with which we obtain roughly 154% of linerate,
we'll need that `200/154 == 1.298` to obtain line rate in both directions. That said, the APU6 has
absolutely no problems saturating a gigabit **in both directions** under normal (==imix)
circumstances.
# Appendix
## Appendix 1 - Terminology
**Term** | **Description**
-------- | ---------------
OADM | **optical add drop multiplexer** -- a device used in wavelength-division multiplexing systems for multiplexing and routing different channels of light into or out of a single mode fiber (SMF)
ONT | **optical network terminal** - The ONT converts fiber-optic light signals to copper based electric signals, usually Ethernet.
OTO | **optical telecommunication outlet** - The OTO is a fiber optic outlet that allows easy termination of cables in an office and home environment. Installed OTOs are referred to by their OTO-ID.
CARP | **common address redundancy protocol** - Its purpose is to allow multiple hosts on the same network segment to share an IP address. CARP is a secure, free alternative to the Virtual Router Redundancy Protocol (VRRP) and the Hot Standby Router Protocol (HSRP).
SIT | **simple internet transition** - Its purpose is to interconnect isolated IPv6 networks, located in global IPv4 Internet via tunnels.
STB | **set top box** - a device that enables a television set to become a user interface to the Internet and also enables a television set to receive and decode digital television (DTV) broadcasts.
GRE | **generic routing encapsulation** - a tunneling protocol developed by Cisco Systems that can encapsulate a wide variety of network layer protocols inside virtual point-to-point links over an Internet Protocol network.
L2VPN | **layer2 virtual private network** - a service that emulates a switched Ethernet (V)LAN across a pseudo-wire (typically an IP tunnel)
DHCP | **dynamic host configuration protocol** - an IPv4 network protocol that enables a server to automatically assign an IP address to a computer from a defined range of numbers.
DHCP6-PD | **Dynamic host configuration protocol: prefix delegation** - an IPv6 network protocol that enables a server to automatically assign network prefixes to a customer from a defined range of numbers.
NDP NS/NA | **neighbor discovery protocol: neighbor solicitation / advertisement** - an ipv6 specific protocol to discover and judge reachability of other nodes on a shared link.
NDP RS/RA | **neighbor discovery protocol: router solicitation / advertisement** - an ipv6 specific protocol to discover and install local address and gateway information.
SBC | **single board computer** - a compute computer with all peripherals and components directly attached to the board.

View File

@ -0,0 +1,315 @@
---
date: "2021-07-26T11:16:44Z"
title: A story of a Bucketlist
---
## Introduction
Many people maintain what is called a Bucketlist, a list of things they
wish to do before they _kick the bucket_. I have one also, and although
most of the items on that list are earthly and more on the emotional
realm, and private, there is one specific thing that I have wanted to
do ever since I first started working in IT in 1998: Peer at the
Amsterdam Internet Exchange.
This post details striking this particular item off my bucketlist. It's
both indulgent, humblebraggy and incredibly nerdy and it talks a bit about
mental health. If those are trigger words for you, skip ahead to another
post, like my series on VPP ;-)
## 1998 - Netherlands
{{< image width="300px" float="right" src="/assets/bucketlist/bucketlist-ede.png" alt="The Kelvinstraat" >}}
I started working when I was still at the TU/Eindhoven, and after a great
sysadmin job at Radar Internet, which became Track and was sold to Wegener
Arcade, I turned towards networking. After building Freeler (the first
_free_ ISP in the Netherlands) with Adrianus and co, and a small stint at
their primary uplink Intouch with Rager (rest in peace, Brother), I joined
BIT (AS12859) from 2000 to 2006, and it was here where I developed a true
passion for that which makes the internet 'tick': routing protocols.
I was secretly jealous that BIT could afford Junipers, F5 loadbalancers and
large Cisco switches, and I loved working with and on those machines. BIT
had a reseller relationship with BBNed, and were able to directly connect
ADSL modems into their own infrastructure, and as such I could afford to get
myself a subnet from 213.154.224.0/19 routed to my house in Wageningen. It
was where I had a half-19" rack in a clothing closet in our guest bedroom,
and it was there that I decided: I want to eventually participate in the
BGP world and peer at AMS-IX (the only exchange at the time, NLIX was just
starting up, thanks again, Jan!).
Pictured to the right was my first contribution to AS12859 - deploying a
CWDM ring from Ede to Amsterdam and upgrading our backbone from an ATM E3
(34Mbit) and POS STM1 (155Mbit) leased line to Gigabit Ethernet on Juniper
M5 routers, this was in 2001, 20 years ago almost to the month.
## 2008 - Switzerland
{{< image width="300px" float="right" src="/assets/bucketlist/bucketlist-dk2.png" alt="The Cavern" >}}
Fast forward to 2006, I moved to Switzerland and while I remained friendly
with NLNOG and SWINOG (and a few other network operator groups), I did not
pursue the whole internet exchange thing. I had operated networks for the
greater part of a decade, and with my full time job, I spent a lot of time
learning how to be a good _Site Reliability Engineer_. I still had three /24
PI space blocks, used for different purposes in the past, but I was much
more comfortable letting the "real" ISPs announce them - in my case AS25091
[IP-Max](https://ip-max.net/) (thanks, Fred!) and AS13030 [Init7](https://init7.net/)
(thanks, Fredy!) and AS12859 [BIT](https://bit.nl/) (thanks, Michel!). I
cannot remember any meaningful downtime in any of those operators, of course
there is always some, but due to the N+2 nature of my network deployment, I
don't think any global downtime for my internet presence has ever occured.
It's not a coincidence that even Google for the longest time used my website
at [SixXS](https://sixxs.net/) for their own monitoring, now _that_ is
cool. Although Jeroen and I did decide to retire the SixXS project (see my
[Sunset]({% post_url 2017-03-01-sixxs-sunset %}) article on why), the website
is still up and served off of three distinct networks, because I have to stay
true to the SRE life.
Pictured to the right was one of the two racks at Deltalis DK2, a datacenter
built into a mountain in the heart of the swiss Alps. Classic edge/core/border
approach with (at the time) state of the art Cisco 7600 routers. One of these
is destined to become my nightstand at some point, this was in 2013, which
is now (almost) 10 years ago.
### Corona Motivation
My buddy Fred from IP-Max would regularly ask me "why don't you just announce
your /24 yourself?" It'd be fun, he said. In 2007, we registered a /24 PI for
SixXS, and I was always quite content to let _him_ handle the routing. But it
started to itch and a neighbor of mine inadvertently reminded me of this itch
(thanks, Max) by asking me if I was interested to share an L2 ethernet link
with him from our place in Br&uuml;ttisellen to one of the datacenters in
Z&uuml;rich, a distance of about 7km as the photons fly.
{{< image width="110px" float="left" src="/assets/bucketlist/bucketlist-corona.png" alt="The Virus" >}}
I could not resist any longer. I was working long(er) than average hours due
to the work-from-home situation: you easily chop off 45-60min of commute each
day, but I noticed myself spending it in more meetings instead of in the train.
I was slowly getting into a bad state, and my motivation was very low. I wanted
to do something other than sleep-eat-work-sleep and even my jogging went to an
all time minimum. I had very low emotional energy.
To put my mind off of things, I decided to reattach to my networking roots in
a few ways: one was to build an AS and operate it for a while (maybe a few years
until I get bored of it, and then re-parent my IP space to some friendly ISP,
or who knows, cash in rich and sell my IP space to the highest bidder!), and
the other was to continue my desire to have a competent replacement for silicon
now that CPUs-of-now are just as fast as ASICs-of-then, and contribute to DANOS
and VPP.
#### Step 1. Build a basement ISP
So getting a PC with Bird, or in my case, an appliance called [DANOS](https://danosproject.org/)
which uses [DPDK](https://dpdk.org/) to implement wirespeed routing on commodity
x86/64 hardware. So I happily announced my /24 and /48 from NTT's datacenter,
connected to the local internet exchange [Swissix](https://swissix.ch/) and
rented an L2 circuit to my house via [Openfactory](https://openfactory.net/). Also,
I showed that a simple Supermicro (for example [SYS-5018D-FN8T](https://www.supermicro.com/products/system/1u/5018/SYS-5018D-FN8T.cfm))
could easily handle line rate 64 byte frames in both directions on its TenGigabit
interfaces, that's 29Mpps, and still have a responsive IPMI serial console. It
reminded me of the early days of Juniper martini class routers, where Jean would
say ".. and the chassis doesn't even get warm". That's certainly correct today,
cuz that Supermicro draws 35W, which is one microwatt per packet routed!
#### Step 2. Build a European Ring
{{< image width="350px" float="right" src="/assets/bucketlist/bucketlist-staging-ams.png" alt="Staging Amsterdam" >}}
Of course, I cannot end there, as I have a bucketlist item to work towards. I always
wanted to peer in Amsterdam, ever since 2001 when I joined BIT. So I worked out a
plan with Fred, who has also been wanting to go to Amsterdam with his Swiss ISP
[IP-Max](https://ip-max.net/).
So, in a really epic roadtrip full of nerd, Fred and I went into total geek-mode
as we traveled to several European cities to deploy AS50869 on a european ring. I
wrote about my experience extensively in these blog posts:
* [Frankfurt]({% post_url 2021-05-17-frankfurt %}): May 17th 2021.
* [Amsterdam]({% post_url 2021-05-26-amsterdam %}): May 26th 2021.
* [Lille]({% post_url 2021-05-28-lille %}): May 28th 2021.
* [Paris]({% post_url 2021-06-01-paris %}): June 1st 2021.
* [Geneva]({% post_url 2021-07-03-geneva %}): July 3rd 2021.
I think we can now say that I'm _peering on the FLAP_. It's not that this AS50869
carries that much traffic, but it's a very welcome relief of daily worklife to be
able to do something _fun_ and _immediately rewarding_ like turn up a BGP session
and see the traffic go from Zurich to any one of these cities at 10Gbit in any
direction. No congestion, no _packetlo_, just pure horsepower performance.
#### Step 3. Build Linux CP in VPP
Next month, I plan to take [VPP](https://fd.io/) out for an elaborate spin. I've been
running DANOS on my routers for a while now, and I'm pretty happy with it, but there
are a few quirks that are annoying me more and more. Notably, the conversion of Vyatta
style commands in the configuration into an FRR config, are often lossy. There's a few
key features (such as RPKI or LDP signalling for MPLS paths) that I'm missing, and
the dataplane, although pretty stable, has crashed maybe three or four times over the
last year. Note: One of IP-Max's many Cisco ASR9k also had a few line card reboots in
the last year so maybe these crashes are par for the course.
Ever since seeing Netgate and Cisco started work on the Linux Control Plane plugin, which
takes interfaces in the VPP dataplane and exposes those as TAP interfaces in Linux, I've
wanted to contribute to that. I've been determined to make use of VPP+LinuxCP in my own
network. However, development has completely stalled on the plugin; the one that ships with
VPP 21.06 is rudimentary at best: doesn't do QinQ/QinAD; doesn't apply any changes from the
dataplane into the Linux network interface; and the plugin that mirrors netlink message has
been stuck in limbo for a few months. So I reached out to the authors in May and offered to
complete / rewrite the plugins. I find that writing code, compiling and testing it, and
being able to immediately see the improvements in a live network incredibly motivating
and energizing.
Expect to see a few posts in August/September about this work!
## 2021 - Switzerland
{{< image width="400px" float="right" src="/assets/bucketlist/bucketlist-mentalhealth.png" alt="Alpine Health" >}}
I can say that after making a few small tweaks and adjustments, and breaking the WFH
regime into "work" from home and "play" from home, helps a lot. I now have a HDMI
switch that flips my desk from my work Mac into my personal OpenBSD machine, and a
19" rack in my basement with equipment to loadtest and develop VPP, and I often do
some small chores like establish a peering session and happily traceroute from my
basement to Amsterdam.
I've spent some time in the mountains, in a family commitment to go to a new swiss
canton every month. The picture on the right was taken from First in Grindelwald,
looking south towards Eiger and M&ouml;nch. I live in an absolutely beautiful country.
Thanks, Switzerland ;-)
On the Bucketlist front, I have the following to report. I waited a few months before
writing the post, but I can confidently say that accomplishing this L2/L3 path from
my workstation in Br&uuml;ttisellen where I'm typing this blogpost, all the way over
Frankfurt to Amsterdam and being able to reach my original colocation machine at AS8283
[Coloclue](https://coloclue.net/) using only switches, routers and IP addresses I own
is a continual joy. Seeing that my work now affords me a straight gigabit bandwidth
in each direction, makes me just fill with engineering pride and happiness.
```
pim@chumbucket:~$ traceroute ghoul.ipng.nl
traceroute to gripe.ipng.nl (94.142.241.186), 30 hops max, 60 byte packets
1 chbtl0.ipng.ch (194.1.163.66) 0.236 ms 0.178 ms 0.143 ms
2 chrma0.ipng.ch (194.1.163.17) 1.394 ms 1.363 ms 1.332 ms
3 defra0.ipng.ch (194.1.163.25) 7.275 ms 7.362 ms 7.213 ms
4 nlams0.ipng.ch (194.1.163.27) 12.905 ms 12.843 ms 12.844 ms
5 ghoul.ipng.nl (94.142.244.54) 13.120 ms 13.181 ms 13.044 ms
```
And as far as the _actual_ bucketlist item goes, although I made a bit harder on myself
because I moved to Switzerland, IP-Max also made it easier by giving me a great price
on the backhaul connectivity to Amsterdam, so I can report that the bucket list item
is indeed checked off the list:
```
pim@nlams0:~$ show protocols bgp address-family ipv6 unicast summary
IPv6 Unicast Summary:
BGP table version 689670802
RIB entries 251402, using 46 MiB of memory
Peers 67, using 1427 KiB of memory
Peer groups 32, using 2048 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
2a02:1668:a2b:5:869::1 4 51088 1561576 216485 0 0 0 08w4d03h 126136 5
2a02:1668:a2b:5:869::2 4 51088 1546990 216485 0 0 0 08w4d03h 126127 5
2a02:898::d1 4 8283 812846953 127814 0 0 0 08w6d20h 130590 6
2a02:898::d2 4 8283 828908332 127814 0 0 0 08w0d16h 130590 6
2a02:898:146::2 4 112 101560 228562328 0 0 0 06w2d15h 2 132437
2a07:cd40:1::4 4 212855 105513 238069267 0 0 0 2d14h12m 1 132437
2602:fed2:fff:ffff::1 4 137933 4180058 124978 0 0 0 04w4d10h 551 7
2602:fed2:fff:ffff::253 4 209762 2034724 125048 0 0 0 1d00h14m 618 7
2001:7f8:10f::205b:140 4 8283 137242 121460 0 0 0 08w5d17h 34 7
2001:7f8:10f::207b:145 4 8315 278651 274793 0 0 0 06w0d12h 34 7
2001:7f8:10f::500f:139 4 20495 117590 107877 0 0 0 04w3d00h 208 7
2001:7f8:10f::ac47:131 4 44103 152949 55010 0 0 0 05w1d13h 24 7
2001:7f8:10f::af36:129 4 44854 134969 146240 0 0 0 09w2d16h 1 7
2001:7f8:10f::afd1:133 4 45009 35438 35477 0 0 0 01w0d02h 3 7
2001:7f8:10f::e20a:148 4 57866 302505 280603 0 0 0 05w5d18h 161 7
2001:7f8:10f::e3bb:137 4 58299 1419455 104321 0 0 0 04w0d13h 531 7
2001:7f8:10f::ec8d:132 4 60557 120509 108071 0 0 0 01w4d20h 7 7
2001:7f8:10f::3:259e:143 4 206238 278960 272776 0 0 0 04w4d18h 2 7
2001:7f8:10f::3:3e9b:134 4 212635 823944 140075 0 0 0 08w5d17h 1 7
2001:7f8:10f::dc49:253 4 56393 5693179 157171 0 0 0 02w6d22h 26680 7
2001:7f8:10f::dc49:254 4 56393 5698910 162197 0 0 0 08w5d17h 26680 7
2a02:2528:1902::1 4 25091 9964126 137696 0 0 0 09w1d22h 113020 5
2001:7f8:8f::a500:6939:1 4 6939 8496149 138188 0 0 0 01w2d20h 48079 7
2001:7f8:8f::a500:8283:1 4 8283 23251 52823 0 0 0 03w3d02h Active 0
2001:7f8:8f::a501:3335:1 4 13335 3279 3199 0 0 0 1d02h35m 102 7
2001:7f8:8f::a502:495:1 4 20495 117248 107466 0 0 0 04w3d00h 208 7
2001:7f8:8f::a503:2934:1 4 32934 194428 193990 0 0 0 01w3d08h 30 7
2001:7f8:8f::a503:2934:2 4 32934 194035 194002 0 0 0 03w3d11h 30 7
2001:7f8:8f::a504:4854:1 4 44854 0 9052 0 0 0 never Idle (Admin) 0
2001:7f8:8f::a504:5009:1 4 45009 35433 35467 0 0 0 01w0d02h 3 7
2001:7f8:8f::a505:7866:1 4 57866 302602 276459 0 0 0 04w4d01h 161 7
2001:7f8:8f::a505:8299:1 4 58299 912125 141718 0 0 0 04w0d13h 531 7
2001:7f8:8f::a506:557:1 4 60557 120482 108067 0 0 0 01w4d20h 7 7
2001:7f8:8f::a521:2635:1 4 212635 622475 85332 0 0 0 02w5d10h 1 7
2001:7f8:8f::a504:9917:1 4 49917 8370930 158851 0 0 0 03w4d13h 25257 7
2001:7f8:8f::a504:9917:2 4 49917 8397150 160118 0 0 0 04w4d01h 25011 7
2001:7f8:13::a500:714:1 4 714 67722 66645 0 0 0 03w2d03h 146 7
2001:7f8:13::a500:714:2 4 714 68208 66645 0 0 0 03w2d03h 146 7
2001:7f8:13::a500:6939:1 4 6939 10980475 98099 0 0 0 07w0d10h 48079 7
2001:7f8:13::a502:495:1 4 20495 117773 107873 0 0 0 04w0d14h 208 7
2001:7f8:13::a503:4307:1 4 34307 10709086 100814 0 0 0 09w4d23h 23339 7
2001:7f8:13::a503:4307:2 4 34307 10694266 100814 0 0 0 09w4d23h 22137 7
2001:7f8:8f::a504:4103:1 4 44103 152932 55010 0 0 0 05w1d13h 24 7
2001:7f8:b7::a500:8283:1 4 8283 126035 98846 0 0 0 06w4d22h 34 7
2001:7f8:b7::a501:3335:1 4 13335 4277 4157 0 0 0 1d10h34m 102 7
2001:7f8:b7::a502:495:1 4 20495 117588 107871 0 0 0 04w3d00h 208 7
2001:7f8:b7::a504:5009:1 4 45009 35441 35504 0 0 0 01w0d02h 3 7
2001:7f8:b7::a506:557:1 4 60557 120546 108067 0 0 0 01w4d20h 7 7
2001:7f8:b7::a521:2635:1 4 212635 716031 94458 0 0 0 08w5d17h 1 7
2001:7f8:b7::a504:1441:1 4 41441 12911969 107363 0 0 0 08w2d12h 50606 7
2001:7f8:b7::a504:1441:2 4 41441 12733337 107304 0 0 0 08w2d12h 50606 7
Total number of neighbors 67
pim@nlams0:~$ show protocols ospfv3 neighbor
Neighbor ID Pri DeadTime State/IfState Duration I/F[State]
194.1.163.7 1 00:00:32 Full/PointToPoint 62d21:41:24 dp0p6s0f3.100[PointToPoint]
194.1.163.34 1 00:00:39 Full/PointToPoint 27d22:28:30 dp0p6s0f3.200[PointToPoint]
```
There are three full IPv4 and IPv6 transit providers: AS51088 ([A2B Internet](https://a2b-internet.com/),
thanks Erik!), AS8283 ([Coloclue](https://coloclue.net/)) and AS25091 ([IP-Max](https://ip-max.net/),
thanks Fred!). Also, the router is connected directly to Speed-IX, LSIX, FrysIX and NL-IX. Along with
the many other internet exchanges I've connected to, it puts my humble AS50869 as #5
[best connected](https://bgp.he.net/country/CH) ISP in Switzerland!
I mean, just look at that stability, BGP sessions often times up as long as the machine
has been there (remember, I deployed `nlams0.ipng.ch` only in May, so 9 weeks is all we can ask for!).
OSPF uptime (helpfully shown with duration with OSPFv3 on FRR) is impeccable as well. The link with 27d
of uptime is because I took out that router for maintenance 27 days ago to upgrade it to a preliminary
version of DANOS + Bird2, as I prepare the move to VPP + Bird2 later this year.
#### A note on mental health
Mental health includes our emotional, psychological, and social well-being. It
affects how we think, feel, and act. It also helps determine how we handle stress,
relate to others, and make choices. Mental health is important at every stage of
life, from childhood and adolescence through adulthood.
If you've read so far, thanks! I can imagine that some find this story a mixture of
nerd and brag, and that's OK. I am writing these stories because ***I find happiness in writing***
about the small and large technical things that I perceive as important to my
feelings of accomplishment and therefor my wellbeing.
I do many non-nerd and non-technical things, but I try to make it a habit of keeping my personal
life off the internet (I'm not on social media and not often on digital messaging boards or chat
apps). I could tell you equally enthusiastically about those hikes I took in Grindelwald, or
those B&uuml;rli I baked, but that would have to be in person.
Well-being is a positive outcome that is meaningful for people and for many sectors
of society, because it tells us that people perceive that their lives are going
well. However, many indicators that measure living conditions fail to measure what
people think and feel about their lives, such as the quality of their relationships,
their positive emotions and resilience, the realization of their potential, or their
overall satisfaction with life.
I find satisfaction in my modest dabbles with IPng Networks, both the software and
the hardware and physical aspects of it. I encourage everybody to have a safe/fun place
where they spend some meaningful time doing things that _spark joy_. To your health!

View File

@ -0,0 +1,710 @@
---
date: "2021-08-07T06:17:54Z"
title: 'Review: FS S5860-20SQ Switch'
---
[FiberStore](https://fs.com/) is a staple provider of optics and network gear in Europe. Although
I've been buying opfics like SFP+ and QSFP+ from them for years, I rarely looked at the switch
hardware they have on sale, until my buddy Arend suggested one of their switches as a good
alternative for an Internet Exchange Point, one with [Frysian roots](https://frys-ix.net/) no less!
# Executive Summary
{{< image width="400px" float="left" src="/assets/fs-switch/fs-switches.png" alt="Switches" >}}
The FS.com S5860 switch is pretty great: 20x 10G SFP+ ports, 4x 25G SPF28 ports
and 2x 40G QSFP ports, which can also be reconfigued to be 4x10G each. The switch has a
Cisco-like CLI, and great performance. I loadtested a pair of them in L2, QinQ, and in L3 mode,
and they handled all the packets I sent to and through them, with all of 10G, 25G and 40G ports
in use. Considering the redundant power supply with relatively low power usage, silicon based
switching of L2 and L3, I definitely appreciate the price/performance. The switch would be a
better match if it allowed for MPLS based L2VPN services, but it doesn't support that.
## Detailed findings
### Hardware
{{< image width="400px" float="right" src="/assets/fs-switch/fs-switch-inside.png" alt="Inside" >}}
The switch is based on Broadcom's [BCM56170](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56170), codename *Hurricane* with 28x10GbE + 4x25GbE ports internally, for a total switching bandwidth of 380Gbps. I
noticed that the FS website shows 760Gbps of nonblocking capacity, which I can explain: Broadcom
has taken the per port ingress capacity, while FS.com is taking the ingress/egress port
capacity and summing them up. Further, the sales pitch claims 565Mpps which I found curious: if
we divide the available bandwidth of 380Gbps (the number from the Broadcom dataspec) by smallest
possible frame of 84 bytes (672 bits), we'll see 565Mpps. Why FS.com decided to seemingly
arbitrarily double the switching capacity while reporting the nominal fowarding rate, is
beyond me.
You can see more (hires) pictures in this [Photo Album](https://photos.app.goo.gl/6M69UgowGf6yWJmU7).
This Broadcom chip is an SOC (System-on-a-Chip) which comes with an Arm A9 and modest
amount of TCAM on board and packs into a 31x31mm *ball grid array* formfactor. The switch
chip is able to store 16k routes and ACLs - it did not become immediately obvious to me what
the partitioning is (between IPv4 entries, IPv6 entries, L2/L3/L4 ACL entries). One can only
assume that the total sum of TCAM based objects must not exceed 4K entries. This means that
as a campus switch, the L3 functionalty will be great, including with routing protocols such
as OSPF and ISIS. However, BGP with any amount of routing table activity will not be a good
fit for this chip, so my dreams of porting DANOS to it are shot out of the box :-)
This Broadcom chip alone retails for &euro;798,- apiece at [Digikey](https://www.digikey.ie/products/en?keywords=BCM56170B0IFSBG),
with a manufacturer lead time of 50 weeks as of Aug'21, which may be related to the ongoing
foundry and supply chain crisis, I don't know. But at that price point, the retail price of
&euro;1150,- per switch is really attractive.
{{< image width="300px" float="right" src="/assets/fs-switch/noise.png" alt="Noise" >}}
The switch comes with two modular and field-replaceble power supplies (rated at 150W each,
delivering 12V at 12.5A, one fan installed), and with two modular and equally field replaceble
fan trays installed with one fan each. Idle, without any optics installed and with all interfaces
down, the switch draws about 18W of power, which is nice. The fans spin up only when needed,
and by default the switch is quiet, but certainly not silent. I measured it after a tip from
Michael, certainly nothing scientific, but in a silent room that measures a floor of ~30 dBA,
the switch booted up and briefly burst the fans at 60dBA after which it **stabilized at 54dBA**
or thereabouts. This is with both power supplies on, and with my cell phone microphone pointed
directly towards the rear of the device, at 1 meter distance. Or something, IDK, I'm a network
engineer, Jim, not an audio specialist!
Besides the 20x 1G/10G SFP+ ports, 4x 25G ports and 2x 40G ports (which, incidentally, can be
broken out into 4x 10G as well, bringing the Tengig port count to the datasheet specified 28),
the switch also comes with a USB port (which mounts a filesystem on a USB stick, handy to do
firmware upgrades and to copy files such as SSH keys back and forth), an RJ45 1G management
port, which does not participate in the switch at all, and an RJ45 serial port that uses a
standard Cisco cable for access and presents itself as `9600,8n1` to a console server, although
flow control must be disabled on the serial port.
#### Transceiver Compatibility
FS did not attempt any vendor locking or crippleware with the ports and optics, yaay for that.
I successfully inserted Cisco optics, Arista optics, FS.com 'Generic' optics, and several DACs
for 10G, 25G and 40G that I had lying around. The switch is happy to take all of them. The switch,
as one would expect, supports diagnostics, which looks like this:
```
fsw0#show interfaces TFGigabitEthernet0/24 transceiver
Transceiver Type : 25GBASE-LR-SFP28
Connector Type : LC
Wavelength(nm) : 1310
Transfer Distance :
SMF fiber
-- 10km
Digital Diagnostic Monitoring : YES
Vendor Serial Number : G2006362849
Current diagnostic parameters[AP:Average Power]:
Temp(Celsius) Voltage(V) Bias(mA) RX power(dBm) TX power(dBm)
33(OK) 3.29(OK) 38.31(OK) -0.10(OK)[AP] -0.07(OK)
Transceiver current alarm information:
None
```
.. with a helpful shorthand `show interfaces ... trans diag` that only shows the optical budget.
### Software
I bought a pair of switches, and they came delivered with a current firmware version. The devices
idenfity themselves as `FS Campus Switch (S5860-20SQ) By FS.COM Inc` with a hardware version of `1.00`
and a software version of `S5860_FSOS 12.4(1)B0101P1`. Firmware updates can be downloaded from the
FS.com website directly. I'm not certain if there's a viable ONIE firmware for this chip, although the
N8050 certainly can run ONIE, Cumulus and its own ICOS which is backed by Broadcom. Maybe
in the future I could take a better look at the open networking firmware aspects of this type of
hardware, but considering the CAM is tiny and the switch will do L2 in hardware, but L3 only up to
a certain amount of routes (I think 4K or 16K in the FIB, and only 1GB of ram on the SOC), this is
not the right platform to pour energy into trying to get DANOS to run on.
Taking a look at the CLI, it's very Cisco IOS-esque; there's a few small differences, but the look
and feel is definitely familiar. Base configuration kind of looks like this:
```
fsw0#show running-config
hostname fsw0
!
sntp server oob 216.239.35.12
sntp enable
!
username pim privilege 15 secret 5 $1$<redacted>
!
ip name-server oob 8.8.8.8
!
service password-encryption
!
enable service ssh-server
no enable service telnet-server
!
interface Mgmt 0
ip address 192.168.1.10 255.255.255.0
gateway 192.168.1.1
!
snmp-server location Zurich, Switzerland
snmp-server contact noc@ipng.ch
snmp-server community 7 <redacted> ro
!
```
Configuration as well follows the familiar `conf t` (configure terminal) that many of us grew up
with, and `show` command allow for `include` and `exclude` modifiers, of course with all the
shortest-next abbriviations such as `sh int | i Forty` and the likes. VLANs are to be declared
up front, with one notable cool feature of `supervlans`, which are the equivalent of aggregating
VLANs together in the switch - a useful example might be an internet exchange platform which has
trunk ports towards resellers, who might resell VLAN 101, 102, 103 each to an individual customer,
but then all end up in the same peering lan VLAN 100.
A few of the services (SSH, SNMP, DNS, SNTP) can be bound to the management network, but for this
to work, the `oob` keyword has to be used. This likely because the mgmt port is a network interface
that is attached to the SOC, not to the switch fabric itself, and thus its route is not added to
the routing table. I like this, because it avoids the mgmt network to be picked up in OSPF, and
accidentally routed to/from. But it does show a bit more of an awkward config:
```
fsw1#show running-config | inc oob
sntp server oob 216.239.35.12
ip name-server oob 8.8.8.8
ip name-server oob 1.1.1.1
ip name-server oob 9.9.9.9
fsw1#copy ?
WORD Copy origin file from native
flash: Copy origin file from flash: file system
ftp: Copy origin file from ftp: file system
http: Copy origin file from http: file system
oob_ftp: Copy origin file from oob_ftp: file system
oob_http: Copy origin file from oob_http: file system
oob_tftp: Copy origin file from oob_tftp: file system
running-config Copy origin file from running config
startup-config Copy origin file from startup config
tftp: Copy origin file from tftp: file system
tmp: Copy origin file from tmp: file system
usb0: Copy origin file from usb0: file system
```
Note here the hack `oob_ftp:` and such; this would allow the switch to copy things from the
OOB (management) network by overriding the scheme. But that's OK, I guess, not beautiful,
but it gets the job done and these types of commands will rarely be used.
A few configuration examples, notably QinQ, in which I configure a port to take usual dot1q
traffic, say from a customer, and add it into our local VLAN 200. Therefore, untagged traffic
on that port will turn into our VLAN 200, and tagged traffic will turn into our dot1ad stack
of outer VLAN 200 and inner VLAN whatever the customer provided -- in our case allowing only
VLANs 1000-2000 and untagged traffic into VLAN 200:
```
fsw0#confifgure
fsw0(config)#vlan 200
fsw0(config-vlan)#name v-qinq-outer
fsw0(config-vlan)#exit
fsw0(config)#interface TenGigabitEthernet 0/3
fsw0(config-if-TenGigabitEthernet 0/3)#switchport mode dot1q-tunnel
fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel native vlan 200
fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel allowed vlan add untagged 200
fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel allowed vlan add tagged 1000-2000
```
The industry remains conflicted about the outer ethernet frame's type -- originally a tag
protocol identifier (TPID) of 0x9100 was suggested, and that's what this switch uses. But
the first specification of Q-in-Q called 802.1ad specified that the TPID should be 0x88a8
instead of the VLAN tag that was 0x8100. This ugly reality can be reflected directly in the
switchport configuration by adding a `frame-tag tpid 0xXXXX` value to let the switch know
which TPID needs to be used for the outer tag.
If this type of historical thing interests you, I definitely recommend reading up on Wikipedia on
[802.1q](https://en.wikipedia.org/wiki/IEEE_802.1Q) and [802.1ad](https://en.wikipedia.org/wiki/IEEE_802.1ad)
as well.
## Loadtests
For my loadtests, I used Cisco's T-Rex ([ref](https://trex-tgn.cisco.com/)) in stateless mode,
with a custom Python controller that ramps up and down traffic from the loadtester to the device
under test (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be
presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back
in on `port0`). You can read a bit more about my setup in my [Loadtesting at Coloclue]({% post_url 2021-02-27-coloclue-loadtest %})
post.
To stress test the switch, several pairs at 10G and 25G were used, and since the specs boast
line rate forwarding, I immediately ran T-Rex at maximum load with small frames. I found out,
once again, that Intel's X710 network cards aren't line rate, something I'll dive into in a bit
more detail another day, for now, take a look at the [T-Rex docs](https://github.com/cisco-system-traffic-generator/trex-core/blob/master/doc/trex_stateless_bench.asciidoc).
### L2
First let's test a straight forward configuration. I connect a DAC between a 40G port on each
switch, and connect a loadtester to port `TenGigabitEthernet 0/1` and `TenGigabitEthernet 0/2`
on either switch, and leave everything simply in the default VLAN. This means packets from
Te0/1 and Te0/2 go out on Fo0/26, then through the DAC into Fo0/26 on the second switch, and
out on Te0/1 and Te0/2 there, to return to the loadtester. Configuration wise, rather boring:
```
fsw0#configure
fsw0(config)#vlan 1
fsw0(config-vlan)#name v-default
fsw0#show run int te0/1
interface TenGigabitEthernet 0/1
fsw0#show run int te0/2
interface TenGigabitEthernet 0/2
fsw0#show run int fo0/26
interface FortyGigabitEthernet 0/26
switchport mode trunk
switchport trunk allowed vlan only 1
fsw0#show vlan id 1
VLAN Name Status Ports
---------- -------------------------------- --------- -----------------------------------
1 v-default STATIC Te0/1, Te0/2, Te0/3, Te0/4
Te0/5, Te0/6, Te0/7, Te0/8
Te0/9, Te0/10, Te0/11, Te0/12
Te0/13, Te0/14, Te0/15, Te0/16
Te0/17, Te0/18, Te0/19, Te0/20
TF0/21, TF0/22, TF0/23, Fo0/25
Fo0/26
```
I set up T-Rex with unique MAC addresses for each of its ports, I find it useful
to codify a few bits of information into the MAC, such as loadtester machine,
PCI bus, port, so that when I try to find them on the switches in the forwarding
table, and I have many loadtesters running at the same time, it's easier to
find what I'm looking for. My trex configuration for this loadtest:
```
pim@hippo:~$ cat /etc/trex_cfg.yaml
- version : 2
interfaces : ["42:00.0","42:00.1", "42:00.2", "42:00.3"]
port_limit : 4
port_info :
- dest_mac : [0x0,0x2,0x1,0x1,0x0,0x00] # port 0
src_mac : [0x0,0x2,0x1,0x2,0x0,0x00]
- dest_mac : [0x0,0x2,0x1,0x2,0x0,0x00] # port 1
src_mac : [0x0,0x2,0x1,0x1,0x0,0x00]
- dest_mac : [0x0,0x2,0x1,0x3,0x0,0x00] # port 2
src_mac : [0x0,0x2,0x1,0x4,0x0,0x00]
- dest_mac : [0x0,0x2,0x1,0x4,0x0,0x00] # port 3
src_mac : [0x0,0x2,0x1,0x3,0x0,0x00]
```
Here's where I notice something I've noticed before: the Intel X710 network cards cannot
actually fill 4x10G at line rate. They're fine at larger frames, but they max out at about
32Mpps throughput -- and we know that each 10G connection filled with small ethernet frames
in one direction will consume 14.88Mpps. The same is true for the XXV710 cards, the chip
used will really only source 30Mpps across all ports, which is sad but true.
So I have a choice to make: either I run small packets at a rate that's acceptable for the
NIC (~7.5Mpps per port thus 30Mpps across the X710-DA4), or I run `imix` at line rate
but with slightly less packets/sec. I chose the latter for these tests, and will be reporting
the usage based on `imix` profile, which saturates 10G at 3.28Mpps in one direction, or
13.12Mpps per network card.
Of course, I can run two of these at the same time, *pourquois pas*, which looks like this:
```
fsw0#show mac
Vlan MAC Address Type Interface Live Time
---------- -------------------- -------- ------------------------------ -------------
1 0001.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:16:11
1 0001.0102.0000 DYNAMIC TenGigabitEthernet 0/1 0d 00:16:11
1 0001.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:16:11
1 0001.0104.0000 DYNAMIC TenGigabitEthernet 0/2 0d 00:16:10
1 0002.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:15:51
1 0002.0102.0000 DYNAMIC TenGigabitEthernet 0/3 0d 00:15:51
1 0002.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:15:51
1 0002.0104.0000 DYNAMIC TenGigabitEthernet 0/4 0d 00:15:50
fsw0#show int usage | exclude 0.00
Interface Bandwidth Average Usage Output Usage Input Usage
------------------------------------ ----------- ---------------- ---------------- ----------------
TenGigabitEthernet 0/1 10000 Mbit 94.66% 94.66% 94.66%
TenGigabitEthernet 0/2 10000 Mbit 94.66% 94.66% 94.66%
TenGigabitEthernet 0/3 10000 Mbit 94.65% 94.66% 94.66%
TenGigabitEthernet 0/4 10000 Mbit 94.66% 94.66% 94.66%
FortyGigabitEthernet 0/26 40000 Mbit 94.66% 94.66% 94.66%
fsw0#show cpu core
[Slot 0 : S5860-20SQ]
Core 5Sec 1Min 5Min
0 16.40% 12.00% 12.80%
```
This is the first time that I noticed that the switch usage (94.66%) somewhat confusingly
lines up with the observed T-Rex statistics: what the switch reports, T-Rex considers L2
(ethernet) use, not L1 use. For an in-depth explanation of this, see below in the L3
section. But for now, let's just say that when T-Rex says it's sending 37.9Gbps of ethernet
traffic (which is 40.00Gbps of bits on the line), that corresponds to the 94.75% we see
the switch reporting.
So suffice to say, at 80Gbit actual throughput (40G from Te0/1-3 ingress and 40G to
Te0/1-3 egress), the switch performs at line rate, with no noticable lag or jitter.
The CLI is responsive and the fans aren't spinning harder than at idle, even after 60min
of packets. Good!
### QinQ
Then, I reconfigured the switch to let each pair of ports (Te0/1-2 and Te0/3-4) each drop
into a Q-in-Q VLAN, with tag 20 and tag 21 respectively. The configuration:
```
interface TenGigabitEthernet 0/1
switchport mode dot1q-tunnel
switchport dot1q-tunnel allowed vlan add untagged 20
switchport dot1q-tunnel native vlan 20
!
interface TenGigabitEthernet 0/3
switchport mode dot1q-tunnel
switchport dot1q-tunnel allowed vlan add untagged 21
switchport dot1q-tunnel native vlan 21
spanning-tree bpdufilter enable
!
interface FortyGigabitEthernet 0/26
switchport mode trunk
switchport trunk allowed vlan only 1,20-21
fsw0#show mac
Vlan MAC Address Type Interface Live Time
---------- -------------------- -------- ------------------------------ -------------
20 0001.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 01:15:02
20 0001.0102.0000 DYNAMIC TenGigabitEthernet 0/1 0d 01:15:01
20 0001.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 01:15:02
20 0001.0104.0000 DYNAMIC TenGigabitEthernet 0/2 0d 01:15:03
21 0002.0101.0000 DYNAMIC TenGigabitEthernet 0/4 0d 00:01:50
21 0002.0102.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:01:03
21 0002.0103.0000 DYNAMIC TenGigabitEthernet 0/3 0d 00:01:59
21 0002.0104.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:01:02
```
Two things happen that require a bit of explanation. First of all, despite both loadtesters
use the exact same configuration (in fact, I didn't even stop them from emitting packets
while reconfiguring the switch), I now have packetloss, the throughput per 10G port has
reduced from 94.67% to 93.63% and at the same time, I observe that the 40G ports raised
their usage from 94.66% to 94.81%.
```
fsw1#show int usage | e 0.00
Interface Bandwidth Average Usage Output Usage Input Usage
------------------------------------ ----------- ---------------- ---------------- ----------------
TenGigabitEthernet 0/1 10000 Mbit 94.20% 93.63% 94.67%
TenGigabitEthernet 0/2 10000 Mbit 94.21% 93.65% 94.67%
TenGigabitEthernet 0/3 10000 Mbit 91.05% 94.66% 94.66%
TenGigabitEthernet 0/4 10000 Mbit 90.80% 94.66% 94.66%
FortyGigabitEthernet 0/26 40000 Mbit 94.81% 94.81% 94.81%
```
The switches, however, are perfectly fine. The reason for this loss is that when I created
the `dot1q-tunnel`, the switch sticks another VLAN tag (4 bytes, or 32 bits) on each packet
before sending it out the 40G port between the switches, and at these packet rates, it adds
up. Each 10G switchport is receiving 3.28Mpps (for a total of 13.12Mpps) which, when the
switch needs to send it to its peer on the 40G trunk, adds 13.12Mpps * 32 bits = 419.8Mbps
on top of the 40G line rate, implying we're going to be losing roughly 1.045% of our packets.
And indeed, the difference between 94.67 (inbound) and 93.63 (outbound) is 1.04% which lines
up.
```
Global Statistics
connection : localhost, Port 4501 total_tx_L2 : 37.92 Gbps
version : STL @ v2.91 total_tx_L1 : 40.02 Gbps
cpu_util. : 43.52% @ 8 cores (4 per dual port) total_rx : 37.92 Gbps
rx_cpu_util. : 0.0% / 0 pps total_pps : 13.12 Mpps
async_util. : 0% / 198.64 bps drop_rate : 0 bps
total_cps. : 0 cps queue_full : 0 pkts
Port Statistics
port | 0 | 1 | 2 | 3
-----------+-------------------+-------------------+-------------------+------------------
owner | pim | pim | pim | pim
link | UP | UP | UP | UP
state | TRANSMITTING | TRANSMITTING | TRANSMITTING | TRANSMITTING
speed | 10 Gb/s | 10 Gb/s | 10 Gb/s | 10 Gb/s
CPU util. | 46.29% | 46.29% | 40.76% | 40.76%
-- | | | |
Tx bps L2 | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps
Tx bps L1 | 10 Gbps | 10 Gbps | 10 Gbps | 10 Gbps
Tx pps | 3.28 Mpps | 3.27 Mpps | 3.27 Mpps | 3.28 Mpps
Line Util. | 100.04 % | 100.04 % | 100.04 % | 100.04 %
--- | | | |
Rx bps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps
Rx pps | 3.24 Mpps | 3.24 Mpps | 3.23 Mpps | 3.24 Mpps
---- | | | |
opackets | 1891576526 | 1891577716 | 1891547042 | 1891548090
ipackets | 1891576643 | 1891577837 | 1891547158 | 1891548214
obytes | 684435443496 | 684435873418 | 684424773684 | 684425153614
ibytes | 684435484082 | 684435916902 | 684424817178 | 684425197948
tx-pkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts
rx-pkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts
tx-bytes | 684.44 GB | 684.44 GB | 684.42 GB | 684.43 GB
rx-bytes | 684.44 GB | 684.44 GB | 684.42 GB | 684.43 GB
----- | | | |
oerrors | 0 | 0 | 0 | 0
ierrors | 0 | 0 | 0 | 0
```
### L3
For this test, I reconfigured the 25G ports to become routed rather than switched, and I put them
under 80% load with T-Rex (where 80% here is of L1), thus the ports are emitting 20Gbps of traffic
at a rate of 13.12Mpps. I left two of the 10G ports just continuing their ethernet loadtest at
100%, which is also 20Gbps of traffic and 13.12Mpps. In total, I observed 79.95Gbps of traffic
between the two switches: an entirely saturated 40G port in both directions.
I then created a simple topology with OSPF, both switches configured a `Loopback0` interface with
a /32 IPv4 and /128 IPv6 address, and a transit network between them in a `VLAN100` interface.
OSPF and OSPFv3 both distribute connected and static routes, to keep things simple.
Finally, I added an IP address on the `Tf0/24` interface, set a static IPv4 route for 16.0.0.0/8
and 48.0.0.0/8 towards that interface on each switch, and added VLAN 100 to the `Fo0/26` trunk.
It looks like this for switch `fsw0`:
```
interface Loopback 0
ip address 100.64.0.0 255.255.255.255
ipv6 address 2001:DB8::/128
ipv6 enable
interface VLAN 100
ip address 100.65.2.1 255.255.255.252
ipv6 enable
ip ospf network point-to-point
ipv6 ospf network point-to-point
ipv6 ospf 1 area 0
interface TFGigabitEthernet 0/24
no switchport
ip address 100.65.1.1 255.255.255.0
ipv6 address 2001:DB8:1:1::1/64
interface FortyGigabitEthernet 0/26
switchport mode trunk
switchport trunk allowed vlan only 1,20,21,100
router ospf 1
graceful-restart
redistribute connected subnets
redistribute static subnets
area 0
network 100.65.2.0 0.0.0.3 area 0
!
ipv6 router ospf 1
graceful-restart
redistribute connected
redistribute static
area 0
!
ip route 16.0.0.0 255.0.0.0 100.65.1.2
ipv6 route 2001:db8:100::/40 2001:db8:1:1::2
```
With this topology, an L3 routing domain emerges between `Tf0/24` on switch `fsw0` and `Tf0/24`
on switch `fsw1`, and we can inspect this, taking a look at `fsw1`, I can see that both IPv4 and
IPv6 adjacencies have formed, and that the switches, n&eacute;&eacute; routers, have learned
routes from one another:
```
fsw1#show ip ospf neighbor
OSPF process 1, 1 Neighbors, 1 is Full:
Neighbor ID Pri State BFD State Dead Time Address Interface
100.65.2.1 1 Full/ - - 00:00:31 100.65.2.1 VLAN 100
fsw1#show ipv6 ospf neighbor
OSPFv3 Process (1), 1 Neighbors, 1 is Full:
Neighbor ID Pri State BFD State Dead Time Instance ID Interface
100.65.2.1 1 Full/ - - 00:00:31 0 VLAN 100
fsw1#show ip route
Codes: C - Connected, L - Local, S - Static
R - RIP, O - OSPF, B - BGP, I - IS-IS, V - Overflow route
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
E1 - OSPF external type 1, E2 - OSPF external type 2
SU - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
IA - Inter area, EV - BGP EVPN, A - Arp to host
* - candidate default
Gateway of last resort is no set
O E2 16.0.0.0/8 [110/20] via 100.65.2.1, 12:42:13, VLAN 100
S 48.0.0.0/8 [1/0] via 100.65.0.2
O E2 100.64.0.0/32 [110/20] via 100.65.2.1, 00:05:23, VLAN 100
C 100.64.0.1/32 is local host.
C 100.65.0.0/24 is directly connected, TFGigabitEthernet 0/24
C 100.65.0.1/32 is local host.
O E2 100.65.1.0/24 [110/20] via 100.65.2.1, 12:44:57, VLAN 100
C 100.65.2.0/30 is directly connected, VLAN 100
C 100.65.2.2/32 is local host.
fsw1#show ipv6 route
IPv6 routing table name - Default - 12 entries
Codes: C - Connected, L - Local, S - Static
R - RIP, O - OSPF, B - BGP, I - IS-IS, V - Overflow route
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
E1 - OSPF external type 1, E2 - OSPF external type 2
SU - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
IA - Inter area, EV - BGP EVPN, N - Nd to host
O E2 2001:DB8::/128 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100
LC 2001:DB8::1/128 via Loopback 0, local host
C 2001:DB8:1::/64 via TFGigabitEthernet 0/24, directly connected
L 2001:DB8:1::1/128 via TFGigabitEthernet 0/24, local host
O E2 2001:DB8:1:1::/64 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100
O E2 2001:DB8:100::/40 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100
C FE80::/10 via ::1, Null0
C FE80::/64 via Loopback 0, directly connected
L FE80::669D:99FF:FED0:A076/128 via Loopback 0, local host
C FE80::/64 via TFGigabitEthernet 0/24, directly connected
L FE80::669D:99FF:FED0:A076/128 via TFGigabitEthernet 0/24, local host
C FE80::/64 via VLAN 100, directly connected
L FE80::669D:99FF:FED0:A076/128 via VLAN 100, local host
```
Great success! I can see from the `fsw1` output above, its OSPF process has learned
routes for the IPv4 and IPv6 loopbacks (100.64.0.0/32 and 2001:DB8::1/128 respectively),
the connected routes (100.65.1.0/24 and 2001:DB8:1:1::/64 respectively), and the
static routes (16.0.0.0/8 and 2001:db8:100::/40).
So let's make use of this topology and change one of the two loadtesters to switch
to L3 mode instead:
```
pim@hippo:~$ cat /etc/trex_cfg.yaml
- version : 2
interfaces : ["0e:00.0", "0e:00.1" ]
port_bandwidth_gb: 25
port_limit : 2
port_info :
- ip : 100.65.0.2
default_gw : 100.65.0.1
- ip : 100.65.1.2
default_gw : 100.65.1.1
```
I left the loadtest running for 12hrs or so, and observed the results to be squeaky clean.
The loadtester machine was generating ~96Gb/core at 20% utilization, so lazily generating
40.00Gbit of traffic at 25.98Mpps (remember, this was setting the load to 80% on the 25G
port, and 99% on the 10G ports). Looking at the switch and again being surprised about the
discrepancy, I decided to fully explore the curiosity in this switch's utilization reporting.
```
fsw1#show interfaces usage | exclude 0.00
Interface Bandwidth Average Usage Output Usage Input Usage
------------------------------------ ----------- -------------- -------------- -----------
TenGigabitEthernet 0/1 10000 Mbit 93.80% 93.79% 93.81%
TenGigabitEthernet 0/2 10000 Mbit 93.80% 93.79% 93.81%
TFGigabitEthernet 0/24 25000 Mbit 75.80% 75.79% 75.81%
FortyGigabitEthernet 0/26 40000 Mbit 94.79% 94.79% 94.79%
fsw1#show int te0/1 | inc packets/sec
10 seconds input rate 9381044793 bits/sec, 3240802 packets/sec
10 seconds output rate 9378930906 bits/sec, 3240123 packets/sec
fsw1#show int tf0/24 | inc packets/sec
10 seconds input rate 18952369793 bits/sec, 6547299 packets/sec
10 seconds output rate 18948317049 bits/sec, 6545915 packets/sec
fsw1#show int fo0/26 | inc packets/sec
10 seconds input rate 37915517884 bits/sec, 13032078 packets/sec
10 seconds output rate 37915335102 bits/sec, 13026051 packets/sec
```
Looking at that number, 75.80% was not the 80% that I had asked for, and actually the usage of
the 10G ports (which I put at 99% load) and 40G port are also lower than I had anticipated.
What's going on there? It's quite simple after doing some math: ***the switch is reporting
L2 bits/sec, not L1 bits/sec!***
On the L3 loadtest and using the `imix` profile, T-Rex is sending 13.02Mpps of load, which,
according to its own observation is 37.8Gbit of L2 and 40.00Gbps of L1 bandwidth. On the L2
loadtest, again using `imix` profile, T-Rex is sending 4x 3.24Mpps as well, which it claims
is 37.6Gbps of L2 and 39.66Gbps of L1 bandwidth (note: I put the loadtester here at 99% of
line rate, this is to ensure I would not end up losing packets due to congestion on the 40G
port).
So according to T-Rex, I am sending 75.4Gbps of traffic (37.8Gbps in the L2 test and 37.6Gbps
in the simultenous L3 loadtest), yet I'm seeing 37.9Gbps on the switchport. Oh my!
Here's how all of these numbers relate to one another:
* First off, we are sending 99% linerate at 3.24Mpps into Te0/1 and Te0/2 on each switch.
* Then, we are sending 80% linerate at 6.55Mpps into Tf0/24 on each switch.
* The Te0/1 and Te0/2 are both in the default VLAN on either side.
* But, the Tf0/24 is sending its IP traffic through VLAN 100 interconnect, which means all
of that traffic gets a dot1q VLAN tag added. That's 4 bytes for each packet.
* Sending 6.55Mpps * 32bits extra, equals 209600000 bits/sec (0.21Gbps)
* Loadtester claims 37.70Gbps, but the switch sees 37.91Gbps which is exactly the difference
we calculated above (0.21Gbps), and equals the overhead created by adding the VLAN tag on
the 25G stream that is in VLAN tag 100.
Now we are ready to explain the difference between the switch reported port usage and the
loadtester reported port usage:
* The loadtester is sending an `imix` traffic mix, which consts of a ratio of 28:16:4 of
packets that are 64:590:1514 bytes.
* We already know that to create a packet on the wire, we have to add a 7 byte preamble
a 1 byte start frame delimiter, and end with a 12 byte interpacket gap, so each ethernet
frame is 20 bytes longer, making 84 bytes the on-the-wire smallest possible frame.
* We know we're sending 3.24Mpps on a 10G port at 99% T-Rex (L1) usage:
* Each packet needs 20 bytes or 160 bits of overhead, which is 518400000 bits/sec
* We are seeing 9381044793 bits/sec on a 10G port (**corresponding switch 93.80% usage**)
* Adding these two numbers up gives us 9899444793 bits/sec (**corresponding T-Rex 98.99% usage**)
* Conversely, the whole system is sending 37.9Gbps on the 40G port (**corresponding switch 37.9/40 == 94.79% usage**)
* We know this is 2x 10G streams at 99% utilization and 1x25G stream at 80% utilization
* This is 13.03Mpps, which generate 2084800000 bits/sec of overhead
* Adding these two numbers up gives us 40.00 Gbps of usage (which is the expected L1 line rate)
I find it very fulfilling to see these numbers meaningfully add up! Oh, and by the way,
the switches that are now switching and routing all of this with with 0.00% packet loss,
and the chassis doesn't even get warm :-)
```
Global Statistics
connection : localhost, Port 4501 total_tx_L2 : 38.02 Gbps
version : STL @ v2.91 total_tx_L1 : 40.02 Gbps
cpu_util. : 21.88% @ 4 cores (4 per dual port) total_rx : 38.02 Gbps
rx_cpu_util. : 0.0% / 0 pps total_pps : 13.13 Mpps
async_util. : 0% / 39.16 bps drop_rate : 0 bps
total_cps. : 0 cps queue_full : 0 pkts
Port Statistics
port | 0 | 1 | total
-----------+-------------------+-------------------+------------------
owner | pim | pim |
link | UP | UP |
state | TRANSMITTING | TRANSMITTING |
speed | 25 Gb/s | 25 Gb/s |
CPU util. | 21.88% | 21.88% |
-- | | |
Tx bps L2 | 19.01 Gbps | 19.01 Gbps | 38.02 Gbps
Tx bps L1 | 20.06 Gbps | 20.06 Gbps | 40.12 Gbps
Tx pps | 6.57 Mpps | 6.57 Mpps | 13.13 Mpps
Line Util. | 80.23 % | 80.23 % |
--- | | |
Rx bps | 19 Gbps | 19 Gbps | 38.01 Gbps
Rx pps | 6.56 Mpps | 6.56 Mpps | 13.13 Mpps
---- | | |
opackets | 292215661081 | 292215652102 | 584431313183
ipackets | 292152912155 | 292153677482 | 584306589637
obytes | 105733412810506 | 105733412001676 | 211466824812182
ibytes | 105710857873526 | 105711223651650 | 211422081525176
tx-pkts | 292.22 Gpkts | 292.22 Gpkts | 584.43 Gpkts
rx-pkts | 292.15 Gpkts | 292.15 Gpkts | 584.31 Gpkts
tx-bytes | 105.73 TB | 105.73 TB | 211.47 TB
rx-bytes | 105.71 TB | 105.71 TB | 211.42 TB
----- | | |
oerrors | 0 | 0 | 0
ierrors | 0 | 0 | 0
```
### Conclusions
It's just super cool to see a switch like this work as expected. I did not manage to
overload it at all, neither with IPv4 loadtest at 20Mpps and 50Gbit of traffic, nor with
L2 loadtest at 26Mpps and 80Gbit of traffic, with QinQ demonstrably done in hardware as
well as IPv4 route lookups. I will be putting these switches into production soon on the
IPng Networks links between Glattbrugg and R&uuml;mlang in Zurich, thereby upgrading our
backbone from 10G to 25G CWDM. It seems to me, that using these switches as L3 devices
given a smaller OSPF routing domain (currently, we have ~300 prefixes in our OSPF at
AS50869), would definitely work well, as would pushing and popping QinQ trunks for our
customers (for example on Solnet or Init7 or Openfactory).
Approved. A+, will buy again.

View File

@ -0,0 +1,404 @@
---
date: "2021-08-12T11:17:54Z"
title: VPP Linux CP - Part1
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. One thing notably missing, is the higher level control plane, that is
to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a
VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network
devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols
like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet
forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use
VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the
interface state (links, addresses and routes) itself. When the plugin is completed, running software
like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
&gt;100Mpps and &gt;100Gbps forwarding rates will be well in reach!
In this first post, let's take a look at tablestakes: making a copy of VPP's interfaces appear in
the Linux kernel.
## My test setup
I took two AMD64 machines, each with 32GB of memory and one Intel X710-DA4 network card (which offers
four SFP+ cages), and installed Ubuntu 20.04 on them. I connected each of the network ports back
to back with DAC cables. This gives me plenty of interfaces to play with. On the vanilla Ubuntu machine,
I created a bunch of different types of interfaces and configured IPv4 and IPv6 addresses on them.
The goal of this post is to show what code needed to be written and which changes needed to
be made to the plugin, in order to mirror each type of interface from VPP into a valid Linux interface.
As we'll see, marrying the Linux network interface approach with the VPP interface approach can be
tricky! Throughout this post, the vanilla Ubuntu machine will keep the following configuration, the
config file of which you can see in the Appendix:
| Name | type | Addresses
|-----------------|------|----------
| enp66s0f0 | untagged | 10.0.1.2/30 2001:db8:0:1::2/64
| enp66s0f0.q | dot1q 1234 | 10.0.2.2/30 2001:db8:0:2::2/64
| enp66s0f0.qinq | outer dot1q 1234, inner dot1q 1000 | 10.0.3.2/30 2001:db8:0:3::2/64
| enp66s0f0.ad | dot1ad 2345 | 10.0.4.2/30 2001:db8:0:4::2/64
| enp66s0f0.qinad | outer dot1ad 2345, inner dot1q 1000 | 10.0.5.2/30 2001:db8:0:5::2/64
This configuration will allow me to ensure that all common types of sub-interface are supported
by the plugin.
### Startingpoint
The `linux-cp` plugin that ships with VPP 21.06, when initialized with the desired startup config
(see Appendix), will yield this (Hippo is the machine that runs my development branch of VPP, it's
called like that because it's always hungry for packets):
```
pim@hippo:~/src/lcpng$ ip ro
default via 194.1.163.65 dev enp6s0 proto static
10.0.1.0/30 dev e0 proto kernel scope link src 10.0.1.1
10.0.2.0/30 dev e0.1234 proto kernel scope link src 10.0.2.1
10.0.4.0/30 dev e0.1236 proto kernel scope link src 10.0.4.1
194.1.163.64/27 dev enp6s0 proto kernel scope link src 194.1.163.88
pim@hippo:~/src/lcpng$ fping 10.0.1.2 10.0.2.2 10.0.3.2 10.0.4.2 10.0.5.2
10.0.1.2 is alive
10.0.2.2 is alive
10.0.3.2 is unreachable
10.0.4.2 is unreachable
10.0.5.2 is unreachable
pim@hippo:~/src/lcpng$ fping6 2001:db8:0:1::2 2001:db8:0:2::2 \
2001:db8:0:3::2 2001:db8:0:4::2 2001:db8:0:5::2
2001:db8:0:1::2 is alive
2001:db8:0:2::2 is alive
2001:db8:0:3::2 is unreachable
2001:db8:0:4::2 is unreachable
2001:db8:0:5::2 is unreachable
```
Yikes! So the plugin really only knows how to handle untagged interfaces, and sub-interfaces with
one dot1q VLAN tag. The other three scenarios (dot1ad VLAN tag; dot1q in dot1q; and dot1q in dot1ad)
are not ok. And, curiously, the `dot1ad 2345 exact-match` interface _was_ created (as linux interface
`e0.1236`, but it doesn't ping, and I'll show you why :-) But principally: let's fix this plugin!
### Anatomy of Linux Interface Pairs
In VPP, the plumbing to the Linux kernel is done via a TUN/TAP interface. For L3 interfaces, TAP is
used. This TAP appears in the Linux network namesapce as a device with which you can interact. From
the Linux point of view, on egress, all packets coming from the host into the TAP are cross-connected
directly to the logical VPP network interface. In VPP, on ingress, packets destined for an L3 address
on any VPP interface, as well as packets that are multicast, are punted into the TAP, which makes
them appear in the kernel.
In VPP, a linux interface pair (`LIP` for short) is therefore a tuple `{ vpp_phy_idx, vpp_tap_idx, netlink_idx }`.
Creating one of these, is the art of first creating a tap, and associating it with the `vpp_phy`,
copying traffic from it into the dataplane, and punting traffic from the dataplane into the TAP
so that Linux can see it. The plugin exposes an API endpoint that creates, deletes and lists
these linux interface pairs:
```
lcp create <sw_if_index>|<if-name> host-if <host-if-name> netns <namespace> [tun]
lcp delete <sw_if_index>|<if-name>
show lcp [phy <interface>]
```
If you're still with me, congratulations, because this is where it starts to get fun!
### Create interface: physical
The easiest interface type is a physical one. Here, the plugin will create a TAP, copy the MAC
address from the PHY, and set a bunch of attributes on the TAP, such as MTU and link state.
Here, I made my first set of changes (in [[patchset 3](https://gerrit.fd.io/r/c/vpp/+/33481/2..3)])
to the plugin:
* Initialize the link state of the VPP interface, not unconditionally set it to 'down'.
* Initialize the MTU of the VPP interface into the TAP, do not assume it is the VPP default
of 9000; if the MTU is not known, assume the TAP has 9216, the largest possible on ethernet.
Taking a look:
```
DBGvpp# show int TenGigabitEthernet3/0/0
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
TenGigabitEthernet3/0/0 1 down 9000/0/0/0
DBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0
DBGvpp# show tap tap1
Interface: tap1 (ifindex 7)
name "e0"
host-ns "(nil)"
host-mtu-size "9000"
host-mac-addr: 68:05:ca:32:46:14
...
DBGvpp# set interface state TenGigabitEthernet3/0/1 up
DBGvpp# set interface mtu packet 1500 TenGigabitEthernet3/0/1
DBGvpp# lcp create TenGigabitEthernet3/0/1 host-if e1
```
And in Linux, unceremoniously, both interfaces appear:
```
pim@hippo:~/src/lcpng$ ip link show e0
291: e0: <BROADCAST,MULTICAST> mtu 9000 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether 68:05:ca:32:46:14 brd ff:ff:ff:ff:ff:ff
pim@hippo:~/src/lcpng$ ip link show e1
307: e1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 68:05:ca:32:46:15 brd ff:ff:ff:ff:ff:ff
```
The MAC address from the physical interface `show hardware-interface TenGigabitEthernet3/0/0`
corresponds to the one seen in the TAP, and the one seen in the Linux interface we just created.
The Linux interfaces respect the MTU and link state of their counterpart VPP interfaces (`e0` is
down at 9000b, `e1` is up at 1500b).
### Create interface: dot1q
Note that creating an ethernet sub-interface in VPP takes the following form:
```
create sub-interfaces <interface> {<subId> [default|untagged]} | {<subId>-<subId>}
| {<subId> dot1q|dot1ad <vlanId>|any [inner-dot1q <vlanId>|any] [exact-match]}
```
Here, I'll start with the simplest form, canonically called a .1q VLAN or a _tagged_ interface.
The plugin handles it just fine, with a codepath that first creates a sub-interface on the parent's
TAP, forwards traffic to/from the VPP subinterface into the parent TAP, asks the Linux kernel to
create a new interface of type vlan with the `id` set to the dot1q tag, as a child of the `e0`
interface. Note however the `exact-match` keyword, which is very important. In VPP, without setting
exact-match, any ethernet frame that matches the sub-interface expression, will be handled by it.
This means the VLAN with tag 1234, but also a stacked (Q-in-Q or Q-in-AD) VLAN with the outer tag
set to 1234 will match. This is non-sensical for an IP interface, and as such the first two examples
will successfully create, but the third example will crash the plugin:
```
## Good, shorthand sets exact-match
DBGvpp# create sub TenGigabitEthernet3/0/0 1234
DBGvpp# lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234
## Good, explicitly set exact-match
DBGvpp# create sub TenGigabitEthernet3/0/0 1234 dot1q 1234 exact-match
DBGvpp# lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234
## Bad, will crash
DBGvpp# create sub TenGigabitEthernet3/0/0 1234 dot1q 1234
DBGvpp# lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234
```
The reason is that the first call is a shorthand: it creates sub-int 1234 as `dot1q 1234 exact-match`,
which is literally what the second example does, while the third example creates a non-exact-match sub-int
1234 with `dot1q 1234`. So I changed the behavior to explicitly reject sub-interfaces that are not exact-match
in [[patchset 4](https://gerrit.fd.io/r/c/vpp/+/33481/3..4)]. Actually, it turns out that VPP upstream also
crashes on setting an ip address on a sub-int that is not configured with `exact-match`, so I fixed that
upstream in this [[gerrit](https://gerrit.fd.io/r/c/vpp/+/33444)] too.
### Create interface: dot1ad
While by far 802.1q _VLAN_ interfaces are the most used, there's a lesser known sibling called
802.1ad -- the only difference is that VLAN ethernet frames with .1q use the well known 0x8100
ethernet type (called a a tag protocol identifier, or [TPID](https://en.wikipedia.org/wiki/IEEE_802.1Q)), while .1ad uses a
lesser known 0x88a8 type. In the first beginnings, Q-in-Q was suggested to use the 0x88a8 tag
for the outer type, and 0x8100 for the inner type, differentiating the two. But the industry
was conflicted, and many vendors chose to use 0x8100 for both inner- and outer-types, VPP supports
it and so does Linux, so let's implement it in [[patchset 5](https://gerrit.fd.io/r/c/vpp/+/33481/4..5)].
Without this change, the plugin would create the interface, but it would invariably create it as
.1q on the linux side, which explains why the `e0.1236` interface exists but doesn't ping in
my _startingpoint_ above. Now we have the expected behavior:
```
DBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match
DBGvpp# lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236
pim@hippo:~/src/lcpng$ ping 10.0.4.2
PING 10.0.4.2 (10.0.4.2) 56(84) bytes of data.
64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=0.58 ms
64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.57 ms
64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.62 ms
64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.67 ms
^C
--- 10.0.4.2 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3005ms
rtt min/avg/max/mdev = 0.566/0.608/0.672/0.041 ms
```
### Create interface: dot1q in dot1ad
This is the original Q-in-Q as it was intended. Frames here carry an outer ethernet TPID of 0x88a8
(dot1ad) which is followed by an inner ethernet TPID of 0x8100 (dot1q). Of course, untagged inner
frames are also possible - they show up as simply one ethernet TPID of dot1ad followed directly by
the L3 payload. Here, things get a bit more tricky. On the VPP side, we can simply create the
sub-interface directly; but on the Linux side, we cannot do that. This is because in VPP,
all sub-interfaces are directly parented by their physical interface, while in Linux, the
interfaces are stacked on one another. Compare:
```
### VPP idiomatic q-in-ad (1 interface)
DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match
### Linux idiomatic q-in-ad stack (2 interfaces)
ip link add link e0 name e0.2345 type vlan id 2345 proto 802.1ad
ip link add link e0.2345 name e0.2345.1000 type vlan id 1000 proto 802.1q
```
So in order to create Q-in-AD sub-interfaces, for Linux their intermediary parent must exist, while
in VPP this is not necessary. I have to make a compromise, so I'll be a bit more explicit and allow
this type of _LIP_ to be created only under these conditions:
* A sub-int exists with the intermediary (in this case, `dot1ad 2345 exact-match`)
* That sub-int itself has a _LIP_, with a Linux interface device that we can spawn the inner interface off of
If these conditions don't hold, I reject the request. If they do, I create an interface pair:
```
DBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match
DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match
DBGvpp# lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236
DBGvpp# lcp create TenGigabitEthernet3/0/0.1237 host-if e0.1237
pim@hippo:~/src/lcpng$ ip link show e0.1236
375: e0.1236@e0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 68:05:ca:32:46:14 brd ff:ff:ff:ff:ff:ff
pim@hippo:~/src/lcpng$ ip link show e0.1237
376: e0.1237@e0.1236: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 68:05:ca:32:46:14 brd ff:ff:ff:ff:ff:ff
```
Here, `e0.1237` was indeed created as a child of `e0.1236`, which in turn was created as a child of
`e0`.
The code for this is in [[patchset 6](https://gerrit.fd.io/r/c/vpp/+/33481/5..6)].
### Create interface: dot1q in dot1q
Given the change above, this is an entirely obvious capability that the plugin now handles, but I did
find a failure mode, when I tried to create a _LIP_ for a sub-interface when there are no _LIPs_ created.
It causes a NULL deref when trying to look up the _LIP_ of the parent (which doesn't yet have a _LIP_
defined). I fixed that in this [[patchset 7](https://gerrit.fd.io/r/c/vpp/+/33481/6..7)].
## Results
After applying the configuration to VPP (in Appendix), here's the results:
```
pim@hippo:~/src/lcpng$ ip ro
default via 194.1.163.65 dev enp6s0 proto static
10.0.1.0/30 dev e0 proto kernel scope link src 10.0.1.1
10.0.2.0/30 dev e0.1234 proto kernel scope link src 10.0.2.1
10.0.3.0/30 dev e0.1235 proto kernel scope link src 10.0.3.1
10.0.4.0/30 dev e0.1236 proto kernel scope link src 10.0.4.1
10.0.5.0/30 dev e0.1237 proto kernel scope link src 10.0.5.1
194.1.163.64/27 dev enp6s0 proto kernel scope link src 194.1.163.88
pim@hippo:~/src/lcpng$ fping 10.0.1.2 10.0.2.2 10.0.3.2 10.0.4.2 10.0.5.2
10.0.1.2 is alive
10.0.2.2 is alive
10.0.3.2 is alive
10.0.4.2 is alive
10.0.5.2 is alive
pim@hippo:~/src/lcpng$ fping6 2001:db8:0:1::2 2001:db8:0:2::2 \
2001:db8:0:3::2 2001:db8:0:4::2 2001:db8:0:5::2
2001:db8:0:1::2 is alive
2001:db8:0:2::2 is alive
2001:db8:0:3::2 is alive
2001:db8:0:4::2 is alive
2001:db8:0:5::2 is alive
```
As can be seen, all interface types ping. Mirroring interfaces from VPP to Linux is now done!
We still have to manually copy the configuration (like link states, MTU changes, IP addresses and
routes) from VPP into Linux, and of course it would be great if we could mirror those states also
into Linux, and this is exactly the topic of my next post.
## Credits
I'd like to make clear that the Linux CP plugin is a great collaboration between several great folks
and that my work stands on their shoulders. I've had a little bit of help along the way from Neale
Ranns, Matthew Smith and Jon Loeliger, and I'd like to thank them for their work!
## Appendix
#### Ubuntu config
```
# Untagged interface
ip addr add 10.0.1.2/30 dev enp66s0f0
ip addr add 2001:db8:0:1::2/64 dev enp66s0f0
ip link set enp66s0f0 up mtu 9000
# Single 802.1q tag 1234
ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234
ip link set enp66s0f0.q up mtu 9000
ip addr add 10.0.2.2/30 dev enp66s0f0.q
ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q
# Double 802.1q tag 1234 inner-tag 1000
ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000
ip link set enp66s0f0.qinq up mtu 9000
ip addr add 10.0.3.3/30 dev enp66s0f0.qinq
ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq
# Single 802.1ad tag 2345
ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad
ip link set enp66s0f0.ad up mtu 9000
ip addr add 10.0.4.2/30 dev enp66s0f0.ad
ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad
# Double 802.1ad tag 2345 inner-tag 1000
ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q
ip link set enp66s0f0.qinad up mtu 9000
ip addr add 10.0.5.2/30 dev enp66s0f0.qinad
ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad
```
#### VPP config
```
vppctl set interface state TenGigabitEthernet3/0/0 up
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0
vppctl set interface ip address TenGigabitEthernet3/0/0 10.0.1.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0 2001:db8:0:1::1/64
vppctl lcp create TenGigabitEthernet3/0/0 host-if e0
ip link set e0 up mtu 9000
ip addr add 10.0.1.1/30 dev e0
ip addr add 2001:db8:0:1::1/64 dev e0
vppctl create sub TenGigabitEthernet3/0/0 1234
vppctl set interface state TenGigabitEthernet3/0/0.1234 up
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1234
vppctl set interface ip address TenGigabitEthernet3/0/0.1234 10.0.2.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0.1234 2001:db8:0:2::1/64
vppctl lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234
ip link set e0.1234 up mtu 9000
ip addr add 10.0.2.1/30 dev e0.1234
ip addr add 2001:db8:0:2::1/64 dev e0.1234
vppctl create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match
vppctl set interface state TenGigabitEthernet3/0/0.1235 up
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1235
vppctl set interface ip address TenGigabitEthernet3/0/0.1235 10.0.3.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0.1235 2001:db8:0:3::1/64
vppctl lcp create TenGigabitEthernet3/0/0.1235 host-if e0.1235
ip link set e0.1235 up mtu 9000
ip addr add 10.0.3.1/30 dev e0.1235
ip addr add 2001:db8:0:3::1/64 dev e0.1235
vppctl create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match
vppctl set interface state TenGigabitEthernet3/0/0.1236 up
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1236
vppctl set interface ip address TenGigabitEthernet3/0/0.1236 10.0.4.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0.1236 2001:db8:0:4::1/64
vppctl lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236
ip link set e0.1236 up mtu 9000
ip addr add 10.0.4.1/30 dev e0.1236
ip addr add 2001:db8:0:4::1/64 dev e0.1236
vppctl create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match
vppctl set interface state TenGigabitEthernet3/0/0.1237 up
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1237
vppctl set interface ip address TenGigabitEthernet3/0/0.1237 10.0.5.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0.1237 2001:db8:0:5::1/64
vppctl lcp create TenGigabitEthernet3/0/0.1237 host-if e0.1237
ip link set e0.1237 up mtu 9000
ip addr add 10.0.5.1/30 dev e0.1237
ip addr add 2001:db8:0:5::1/64 dev e0.1237
```

View File

@ -0,0 +1,345 @@
---
custom_css:
- /assets/vpp/asciinema-player.css
date: "2021-08-13T15:33:14Z"
title: VPP Linux CP - Part2
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. One thing notably missing, is the higher level control plane, that is
to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a
VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network
devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols
like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet
forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use
VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the
interface state (links, addresses and routes) itself. When the plugin is completed, running software
like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
&gt;100Mpps and &gt;100Gbps forwarding rates will be well in reach!
In this second post, let's make the plugin a bit more useful by making it copy forward state changes
to interfaces in VPP, into their Linux CP counterparts.
## My test setup
I'm using the same setup from the [previous post]({% post_url 2021-08-12-vpp-1 %}). The goal of this
post is to show what code needed to be written and which changes needed to be made to the plugin, in
order to propagate changes to VPP interfaces to the Linux TAP devices.
### Startingpoint
The `linux-cp` plugin that ships with VPP 21.06, even with my [changes](https://gerrit.fd.io/r/c/vpp/+/33481)
is still _only_ able to create _LIP_ devices. It's not very user friendly to have to
apply state changes meticulously on both sides, but it can be done:
```
vppctl lcp create TenGigabitEthernet3/0/0 host-if e0
vppctl set interface state TenGigabitEthernet3/0/0 up
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0
vppctl set interface ip address TenGigabitEthernet3/0/0 10.0.1.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0 2001:db8:0:1::1/64
ip link set e0 up
ip link set e0 mtu 9000
ip addr add 10.0.1.1/30 dev e0
ip addr add 2001:db8:0:1::1/64 dev e0
```
In this snippet, we can see that after creating the _LIP_, thus conjuring up the unconfigured
`e0` interface in Linux, I changed the VPP interface in three ways:
1. I set the state of the VPP interface to 'up'
1. I set the MTU of the VPP interface to 9000
1. I add an IPv4 and IPv6 address to the interface
Because state does not (yet) propagate, I have to make those changes as well on the Linux side
with the subsequent `ip` commands.
### Configuration
I can imagine that operators want to have more control and facilitate the Linux and VPP changes
themselves. This is why I'll start off by adding a variable called `lcp_sync`, along with a
startup configuration keyword and a CLI setter. This allows me to turn the whole sync behavior on
and off, for example in `startup.conf`:
```
linux-cp {
default netns dataplane
lcp-sync
}
```
And in the CLI:
```
DBGvpp# show lcp
lcp default netns dataplane
lcp lcp-sync on
DBGvpp# lcp lcp-sync off
DBGvpp# show lcp
lcp default netns dataplane
lcp lcp-sync off
```
The prep work for the rest of the interface syncer starts with this
[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
for the rest of this blog post, the behavior will be in the 'on' position.
### Change interface: state
Immediately, I find a dissonance between VPP and Linux: When Linux sets a parent interface down,
all children go to state `M-DOWN`. When Linux sets a parent interface up, all of its children
automatically go to state `UP` and `LOWER_UP`. To illustrate:
```
ip link set enp66s0f1 down
ip link add link enp66s0f1 name foo type vlan id 1234
ip link set foo down
## Both interfaces are down, which makes sense because I set them both down
ip link | grep enp66s0f1
9: enp66s0f1: <BROADCAST,MULTICAST> mtu 9000 qdisc mq state DOWN mode DEFAULT group default qlen 1000
61: foo@enp66s0f1: <BROADCAST,MULTICAST,M-DOWN> mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000
ip link set enp66s0f1 up
ip link | grep enp66s0f1
## Both interfaces are up, which doesn't make sense because I only changed one of them!
9: enp66s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
61: foo@enp66s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
```
VPP does not work this way. In VPP, the admin state of each interface is individually
controllable, so it's possible to bring up the parent while leaving the sub-interface in
the state it was. I did notice that you can't bring up a sub-interface if its parent
is down, which I found counterintuitive, but that's neither here nor there.
All of this is to say that we have to be careful when copying state forward, because as
this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
shows, issuing `set int state ... up` on an interface, won't touch its sub-interfaces in VPP, but
the subsequent netlink message to bring the _LIP_ for that interface up, **will** update the
children, thus desynchronising Linux and VPP: Linux will have interface **and all its
sub-interfaces** up unconditionally; VPP will have the interface up and its sub-interfaces in
whatever state they were before.
To address this, a second
[[commit](https://github.com/pimvanpelt/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
needed. I'm not too sure I want to keep this behavior, but for now, it results in an intuitive
end-state, which is that all interfaces states are exactly the same between Linux and VPP.
```
DBGvpp# create sub TenGigabitEthernet3/0/0 10
DBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0
DBGvpp# lcp create TenGigabitEthernet3/0/0.10 host-if e0.10
DBGvpp# set int state TenGigabitEthernet3/0/0 up
## Correct: parent is up, sub-int is not
694: e0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000
695: e0.10@e0: <BROADCAST,MULTICAST> mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
DBGvpp# set int state TenGigabitEthernet3/0/0.10 up
## Correct: both interfaces up
694: e0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000
695: e0.10@e0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
DBGvpp# set int state TenGigabitEthernet3/0/0 down
DBGvpp# set int state TenGigabitEthernet3/0/0.10 down
DBGvpp# set int state TenGigabitEthernet3/0/0 up
## Correct: only the parent is up
694: e0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000
695: e0.10@e0: <BROADCAST,MULTICAST> mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
```
### Change interface: MTU
Finally, a straight forward
[[commit](https://github.com/pimvanpelt/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
so I thought. When the MTU changes in VPP (with `set interface mtu packet N <int>`), there is
callback that can be registered which copies this into the _LIP_. I did notice a specific corner
case: In VPP, a sub-interface can have a larger MTU than its parent. In Linux, this cannot happen,
so the following remains problematic:
```
DBGvpp# create sub TenGigabitEthernet3/0/0 10
DBGvpp# set int mtu packet 1500 TenGigabitEthernet3/0/0
DBGvpp# set int mtu packet 9000 TenGigabitEthernet3/0/0.10
## Incorrect: sub-int has larger MTU than parent, valid in VPP, not in Linux
694: e0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000
695: e0.10@e0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
```
I think the best way to ensure this works is to _clamp_ the sub-int to a maximum MTU of
that of its parent, and revert the user's request to change the VPP sub-int to anything
higher than that, perhaps logging an error explaining why. This means two things:
1. Any change in VPP of a child MTU to larger than its parent, must be reverted.
1. Any change in VPP of a parent MTU should ensure all children are clamped to at most that.
I addressed the issue in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].
### Change interface: IP Addresses
There are three scenarios in which IP addresses will need to be copied from
VPP into the companion Linux devices:
1. `set interface ip address` adds an IPv4 or IPv6 address. This is handled by
`lcp_itf_ip[46]_add_del_interface_addr()` which is a callback installed in
`lcp_itf_pair_init()` at plugin initialization time.
1. `set interface ip address del` removes addresses. This is also handled by
`lcp_itf_ip[46]_add_del_interface_addr()` but curiously there is no
upstream `vnet_netlink_del_ip[46]_addr()` so I had to write them inline here.
I will try to get them upstreamed, as they appear to be obvious companions
in `vnet/device/netlink.h`.
1. This one is easy to overlook, but upon _LIP_ creation, it could be that there
are already L3 addresses present on the VPP interface. If so, set them in the
_LIP_ with `lcp_itf_set_interface_addr()`.
This means with this
[[commit](https://github.com/pimvanpelt/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
any time a new _LIP_ is created, the IPv4 and IPv6 address on the VPP interface are fully copied
over by the third change, while at runtime, new addresses can be set/removed as well by the first
and second change.
### Further work
I noticed that [Bird](https://bird.network.cz/) periodically scans the Linux
interface list and (re)learns information from them. I have a suspicion that
such a feature might be useful in the VPP plugin as well: I can imagine a
periodical process that walks over the _LIP_ interface list, and compares
what it finds in Linux with what is configured in VPP. What's not entirely
clear to me is which direction should 'trump', that is, should the Linux
state be forced into VPP, or should the VPP state be forced into Linux? I
don't yet have a good feeling of the answer, so I'll punt on that for now.
## Results
After applying the configuration to VPP (in Appendix), here's the results:
```
pim@hippo:~/src/lcpng$ ip ro
default via 194.1.163.65 dev enp6s0 proto static
10.0.1.0/30 dev e0 proto kernel scope link src 10.0.1.1
10.0.2.0/30 dev e0.1234 proto kernel scope link src 10.0.2.1
10.0.3.0/30 dev e0.1235 proto kernel scope link src 10.0.3.1
10.0.4.0/30 dev e0.1236 proto kernel scope link src 10.0.4.1
10.0.5.0/30 dev e0.1237 proto kernel scope link src 10.0.5.1
194.1.163.64/27 dev enp6s0 proto kernel scope link src 194.1.163.88
pim@hippo:~/src/lcpng$ fping 10.0.1.2 10.0.2.2 10.0.3.2 10.0.4.2 10.0.5.2
10.0.1.2 is alive
10.0.2.2 is alive
10.0.3.2 is alive
10.0.4.2 is alive
10.0.5.2 is alive
pim@hippo:~/src/lcpng$ fping6 2001:db8:0:1::2 2001:db8:0:2::2 \
2001:db8:0:3::2 2001:db8:0:4::2 2001:db8:0:5::2
2001:db8:0:1::2 is alive
2001:db8:0:2::2 is alive
2001:db8:0:3::2 is alive
2001:db8:0:4::2 is alive
2001:db8:0:5::2 is alive
```
In case you were wondering: my previous post ended in the same huzzah moment. It did.
The difference is that now the VPP configuration is _much shorter_! Comparing
the Appendix from this post with my [first post]({% post_url 2021-08-12-vpp-1 %}), after
all of this work I no longer have to manually copy the configuration (like link states,
MTU changes, IP addresses) from VPP into Linux, instead the plugin does all of this work
for me, and I can configure both sides entirely with `vppctl` commands!
### Bonus screencast!
Humor me as I [take the code out](https://asciinema.org/a/430411) for a 5 minute spin :-)
<script id="asciicast-430411" src="https://asciinema.org/a/430411.js" async></script>
## Credits
I'd like to make clear that the Linux CP plugin is a great collaboration between several great folks
and that my work stands on their shoulders. I've had a little bit of help along the way from Neale
Ranns, Matthew Smith and Jon Loeliger, and I'd like to thank them for their work!
## Appendix
#### Ubuntu config
```
# Untagged interface
ip addr add 10.0.1.2/30 dev enp66s0f0
ip addr add 2001:db8:0:1::2/64 dev enp66s0f0
ip link set enp66s0f0 up mtu 9000
# Single 802.1q tag 1234
ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234
ip link set enp66s0f0.q up mtu 9000
ip addr add 10.0.2.2/30 dev enp66s0f0.q
ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q
# Double 802.1q tag 1234 inner-tag 1000
ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000
ip link set enp66s0f0.qinq up mtu 9000
ip addr add 10.0.3.3/30 dev enp66s0f0.qinq
ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq
# Single 802.1ad tag 2345
ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad
ip link set enp66s0f0.ad up mtu 9000
ip addr add 10.0.4.2/30 dev enp66s0f0.ad
ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad
# Double 802.1ad tag 2345 inner-tag 1000
ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q
ip link set enp66s0f0.qinad up mtu 9000
ip addr add 10.0.5.2/30 dev enp66s0f0.qinad
ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad
```
#### VPP config
```
## Look mom, no `ip` commands!! :-)
vppctl set interface state TenGigabitEthernet3/0/0 up
vppctl lcp create TenGigabitEthernet3/0/0 host-if e0
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0
vppctl set interface ip address TenGigabitEthernet3/0/0 10.0.1.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0 2001:db8:0:1::1/64
vppctl create sub TenGigabitEthernet3/0/0 1234
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1234
vppctl lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234
vppctl set interface state TenGigabitEthernet3/0/0.1234 up
vppctl set interface ip address TenGigabitEthernet3/0/0.1234 10.0.2.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0.1234 2001:db8:0:2::1/64
vppctl create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match
vppctl set interface state TenGigabitEthernet3/0/0.1235 up
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1235
vppctl lcp create TenGigabitEthernet3/0/0.1235 host-if e0.1235
vppctl set interface ip address TenGigabitEthernet3/0/0.1235 10.0.3.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0.1235 2001:db8:0:3::1/64
vppctl create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match
vppctl set interface state TenGigabitEthernet3/0/0.1236 up
vppctl lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1236
vppctl set interface ip address TenGigabitEthernet3/0/0.1236 10.0.4.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0.1236 2001:db8:0:4::1/64
vppctl create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match
vppctl set interface state TenGigabitEthernet3/0/0.1237 up
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1237
vppctl set interface ip address TenGigabitEthernet3/0/0.1237 10.0.5.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0.1237 2001:db8:0:5::1/64
vppctl lcp create TenGigabitEthernet3/0/0.1237 host-if e0.1237
```
#### Final note
You may have noticed that the [commit] links are all git commits in my private working copy. I want
to wait until my [previous work](https://gerrit.fd.io/r/c/vpp/+/33481) is reviewed and submitted
before piling on more changes. Feel free to contact vpp-dev@ for more information in the mean time
:-)

View File

@ -0,0 +1,414 @@
---
date: "2021-08-15T11:13:14Z"
title: VPP Linux CP - Part3
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. One thing notably missing, is the higher level control plane, that is
to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a
VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network
devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols
like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet
forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use
VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the
interface state (links, addresses and routes) itself. When the plugin is completed, running software
like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
&gt;100Mpps and &gt;100Gbps forwarding rates will be well in reach!
In this third post, I'll be adding a convenience feature that I think will be popular: the plugin
will now automatically create or delete _LIPs_ for sub-interfaces where-ever the parent has a _LIP_
configured.
## My test setup
I've extended the setup from the [first post]({% post_url 2021-08-12-vpp-1 %}). The base
configuration for the `enp66s0f0` interface remains exactly the same, but I've also added
an LACP `bond0` interface, which also has the whole kitten kaboodle of sub-interfaces defined, see
below in the Appendix for details, but here's the table again for reference:
| Name | type | Addresses
|-----------------|------|----------
| enp66s0f0 | untagged | 10.0.1.2/30 2001:db8:0:1::2/64
| enp66s0f0.q | dot1q 1234 | 10.0.2.2/30 2001:db8:0:2::2/64
| enp66s0f0.qinq | outer dot1q 1234, inner dot1q 1000 | 10.0.3.2/30 2001:db8:0:3::2/64
| enp66s0f0.ad | dot1ad 2345 | 10.0.4.2/30 2001:db8:0:4::2/64
| enp66s0f0.qinad | outer dot1ad 2345, inner dot1q 1000 | 10.0.5.2/30 2001:db8:0:5::2/64
| bond0 | untagged | 10.1.1.2/30 2001:db8:1:1::2/64
| bond0.q | dot1q 1234 | 10.1.2.2/30 2001:db8:1:2::2/64
| bond0.qinq | outer dot1q 1234, inner dot1q 1000 | 10.1.3.2/30 2001:db8:1:3::2/64
| bond0.ad | dot1ad 2345 | 10.1.4.2/30 2001:db8:1:4::2/64
| bond0.qinad | outer dot1ad 2345, inner dot1q 1000 | 10.1.5.2/30 2001:db8:1:5::2/64
The goal of this post is to show what code needed to be written and which changes needed to be
made to the plugin, in order to automatically create and delete sub-interfaces.
### Startingpoint
Based on the state of the plugin after the [second post]({% post_url 2021-08-13-vpp-2 %}),
operators must create _LIP_ instances for interfaces as well as each sub-interface
explicitly:
```
DBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0
DBGvpp# create sub TenGigabitEthernet3/0/0 1234
DBGvpp# lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234
DBGvpp# create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match
DBGvpp# lcp create TenGigabitEthernet3/0/0.1235 host-if e0.1235
DBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match
DBGvpp# lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236
DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match
DBGvpp# lcp create TenGigabitEthernet3/0/0.1237 host-if e0.1237
```
But one might ask -- is it really useful to have L3 interfaces in VPP without a companion interface
in an appropriate Linux namespace? I think the answer might be 'yes' for individual interfaces
(for example, in a mgmt VRF that has no need to run routing protocols), but I also think the answer
is probably 'no' for sub-interfaces, once their parent has a _LIP_ defined.
### Configuration
The original plugin (the one that ships with VPP 21.06) has a configuration flag that seems
promising by defining a flag `interface-auto-create`, but its implementation was never finished.
I've removed that flag and replaced it with a new one. The main reason for this decision is
that there are actually two kinds of auto configuration: the first one is detailed in this post,
but in the future, I will also make it possible to create VPP interfaces by creating their Linux
counterpart (eg. `ip link add link e0 name e0.1234 type vlan id 1234` with a configuration statement
that might be called `netlink-auto-subint`), and I'd like for the plugin to individually
enable/disable both types. Also, I find the name unfortunate, as the feature should create
_and delete LIPs_ on sub-interfaces, not just create them. So out with the old, in with the new :)
I have to acknowledge that not everybody will want automagically created interfaces, similar to
the original configuration, so I define a new configuration flag called `lcp-auto-subint` which goes
into the `linux-cp` module configuration stanza in VPP's `startup.conf`, which might look a little
like this:
```
linux-cp {
default netns dataplane
lcp-auto-subint
}
```
Based on this config, I set the startup default in `lcp_set_lcp_auto_subint()`, but I realize that
an administrator may want to turn it on/off at runtime, too, so I add a CLI getter/setter that
interacts with the flag in this [[commit](https://github.com/pimvanpelt/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:
```
DBGvpp# show lcp
lcp default netns dataplane
lcp lcp-auto-subint on
lcp lcp-sync off
DBGvpp# lcp lcp-auto-subint off
DBGvpp# show lcp
lcp default netns dataplane
lcp lcp-auto-subint off
lcp lcp-sync off
```
The prep work for the rest of the interface syncer starts with this
[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
for the rest of this blog post, the behavior will be in the 'on' position.
The code for the configuration toggle is in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
### Auto create/delete sub-interfaces
The original plugin code (that ships with VPP 21.06) made a start by defining a function called
`lcp_itf_phy_add()` and registering an intent with `VNET_SW_INTERFACE_ADD_DEL_FUNCTION()`. I've
moved the function to the source file I created in [Part 2]({% post_url 2021-08-13-vpp-2 %})
(called `lcp_if_sync.c`), specifically to handle interface syncing, and gave it a name that
matches the VPP callback, so `lcp_itf_interface_add_del()`.
The logic of that function is pretty straight forward. I want to only continue if `lcp-auto-subint`
is set, and I only want to create or delete sub-interfaces, not parents. This way, the operator
can decide on a per-interface basis if they want it to participate in Linux (eg, issuing
`lcp create BondEthernet0 host-if be0`). After I've established that (a) the caller wants
auto-creation/auto-deletion, and (b) we're fielding a callback for a sub-int, all I must do is:
* On creation: does the parent interface `sw->sup_sw_if_index` have a `LIP`? If yes, let's
create a `LIP` for this sub-interface, too. We determine that Linux interface name by taking
the parent name (say, `be0`), and sticking the sub-int number after it, like `be0.1234`.
* On deletion: does this sub-interface we're fielding the callback for have a `LIP`? If yes,
then delete it.
I noticed that interface deletion had a bug (one that I fell victim to as well: it does not
remove the netlink device in the correct network namespace), which I fixed.
The code for the auto create/delete and the bugfix is in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
### Further Work
One other thing I noticed (and this is actually a bug!) is that on `BondEthernet`
interfaces, upon creation a temporary MAC is assigned, which is subsequently
overwritten by the first physical interface that is added to the bond, which means
that when a _LIP_ is created _before_ the first interface is added, its MAC
will be the temporary MAC. Compare:
```
vppctl create bond mode lacp load-balance l2
vppctl lcp create BondEthernet0 host-if be0
## MAC of be0 is now a temp/ephemeral MAC
vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2
vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3
## MAC of the BondEthernet0 device is now that of TenGigabitEthernet3/0/2
## MAC of TenGigabitEthernet3/0/3 is that of BondEthernet0 (ie. Te3/0/2)
```
In such a situation, `be0` will not be reachable unless it's manually set to the correct MAC.
I looked around but found no callback of event handler for MAC address changes in VPP -- so I
should add one probably, but in the mean time, I'll just add interfaces to the bond before
creating the _LIP_, like so:
```
vppctl create bond mode lacp load-balance l2
vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2
## MAC of the BondEthernet0 device is now that of TenGigabitEthernet3/0/2
vppctl lcp create BondEthernet0 host-if be0
## MAC of be0 is now that of BondEthernet0
vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3
## MAC of TenGigabitEthernet3/0/3 is that of BondEthernet0 (ie. Te3/0/2)
```
.. which is an adequate workaround for now.
## Results
After this code is in, the operator will only have to create a LIP for the main interfaces, and
the plugin will take care of the rest!
```
pim@hippo:~/src/lcpng$ grep 'create' config3.sh
vppctl lcp lcp-auto-subint on
vppctl lcp create TenGigabitEthernet3/0/0 host-if e0
vppctl create sub TenGigabitEthernet3/0/0 1234
vppctl create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match
vppctl create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match
vppctl create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match
vppctl create bond mode lacp load-balance l2
vppctl lcp create BondEthernet0 host-if be0
vppctl create sub BondEthernet0 1234
vppctl create sub BondEthernet0 1235 dot1q 1234 inner-dot1q 1000 exact-match
vppctl create sub BondEthernet0 1236 dot1ad 2345 exact-match
vppctl create sub BondEthernet0 1237 dot1ad 2345 inner-dot1q 1000 exact-match
```
And as an end-to-end functional validation, now extended as well to ping the Ubuntu machine over
the LACP interface and all of its subinterfaces, works like a charm:
```
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip link | grep e0
1063: e0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000
1064: be0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000
209: e0.1234@e0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
210: e0.1235@e0.1234: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
211: e0.1236@e0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
212: e0.1237@e0.1236: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
213: be0.1234@be0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
214: be0.1235@be0.1234: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
215: be0.1236@be0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
216: be0.1237@be0.1236: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
# The TenGigabitEthernet3/0/0 (e0) interfaces
pim@hippo:~/src/lcpng$ fping 10.0.1.2 10.0.2.2 10.0.3.2 10.0.4.2 10.0.5.2
10.0.1.2 is alive
10.0.2.2 is alive
10.0.3.2 is alive
10.0.4.2 is alive
10.0.5.2 is alive
pim@hippo:~/src/lcpng$ fping6 2001:db8:0:1::2 2001:db8:0:2::2 \
2001:db8:0:3::2 2001:db8:0:4::2 2001:db8:0:5::2
2001:db8:0:1::2 is alive
2001:db8:0:2::2 is alive
2001:db8:0:3::2 is alive
2001:db8:0:4::2 is alive
2001:db8:0:5::2 is alive
## The BondEthernet0 (be0) interfaces
pim@hippo:~/src/lcpng$ fping 10.1.1.2 10.1.2.2 10.1.3.2 10.1.4.2 10.1.5.2
10.1.1.2 is alive
10.1.2.2 is alive
10.1.3.2 is alive
10.1.4.2 is alive
10.1.5.2 is alive
pim@hippo:~/src/lcpng$ fping6 2001:db8:1:1::2 2001:db8:1:2::2 \
2001:db8:1:3::2 2001:db8:1:4::2 2001:db8:1:5::2
2001:db8:1:1::2 is alive
2001:db8:1:2::2 is alive
2001:db8:1:3::2 is alive
2001:db8:1:4::2 is alive
2001:db8:1:5::2 is alive
```
## Credits
I'd like to make clear that the Linux CP plugin is a great collaboration between several great folks
and that my work stands on their shoulders. I've had a little bit of help along the way from Neale
Ranns, Matthew Smith and Jon Loeliger, and I'd like to thank them for their work!
## Appendix
#### Ubuntu config
```
# Untagged interface
ip addr add 10.0.1.2/30 dev enp66s0f0
ip addr add 2001:db8:0:1::2/64 dev enp66s0f0
ip link set enp66s0f0 up mtu 9000
# Single 802.1q tag 1234
ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234
ip link set enp66s0f0.q up mtu 9000
ip addr add 10.0.2.2/30 dev enp66s0f0.q
ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q
# Double 802.1q tag 1234 inner-tag 1000
ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000
ip link set enp66s0f0.qinq up mtu 9000
ip addr add 10.0.3.2/30 dev enp66s0f0.qinq
ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq
# Single 802.1ad tag 2345
ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad
ip link set enp66s0f0.ad up mtu 9000
ip addr add 10.0.4.2/30 dev enp66s0f0.ad
ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad
# Double 802.1ad tag 2345 inner-tag 1000
ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q
ip link set enp66s0f0.qinad up mtu 9000
ip addr add 10.0.5.2/30 dev enp66s0f0.qinad
ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad
## Bond interface
ip link add bond0 type bond mode 802.3ad
ip link set enp66s0f2 down
ip link set enp66s0f3 down
ip link set enp66s0f2 master bond0
ip link set enp66s0f3 master bond0
ip link set enp66s0f2 up
ip link set enp66s0f3 up
ip link set bond0 up
ip addr add 10.1.1.2/30 dev bond0
ip addr add 2001:db8:1:1::2/64 dev bond0
ip link set bond0 up mtu 9000
# Single 802.1q tag 1234
ip link add link bond0 name bond0.q type vlan id 1234
ip link set bond0.q up mtu 9000
ip addr add 10.1.2.2/30 dev bond0.q
ip addr add 2001:db8:1:2::2/64 dev bond0.q
# Double 802.1q tag 1234 inner-tag 1000
ip link add link bond0.q name bond0.qinq type vlan id 1000
ip link set bond0.qinq up mtu 9000
ip addr add 10.1.3.2/30 dev bond0.qinq
ip addr add 2001:db8:1:3::2/64 dev bond0.qinq
# Single 802.1ad tag 2345
ip link add link bond0 name bond0.ad type vlan id 2345 proto 802.1ad
ip link set bond0.ad up mtu 9000
ip addr add 10.1.4.2/30 dev bond0.ad
ip addr add 2001:db8:1:4::2/64 dev bond0.ad
# Double 802.1ad tag 2345 inner-tag 1000
ip link add link bond0.ad name bond0.qinad type vlan id 1000 proto 802.1q
ip link set bond0.qinad up mtu 9000
ip addr add 10.1.5.2/30 dev bond0.qinad
ip addr add 2001:db8:1:5::2/64 dev bond0.qinad
```
#### VPP config
```
## No more `lcp create` commands for sub-interfaces.
vppctl lcp default netns dataplane
vppctl lcp lcp-auto-subint on
vppctl lcp create TenGigabitEthernet3/0/0 host-if e0
vppctl set interface state TenGigabitEthernet3/0/0 up
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0
vppctl set interface ip address TenGigabitEthernet3/0/0 10.0.1.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0 2001:db8:0:1::1/64
vppctl create sub TenGigabitEthernet3/0/0 1234
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1234
vppctl set interface state TenGigabitEthernet3/0/0.1234 up
vppctl set interface ip address TenGigabitEthernet3/0/0.1234 10.0.2.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0.1234 2001:db8:0:2::1/64
vppctl create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match
vppctl set interface state TenGigabitEthernet3/0/0.1235 up
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1235
vppctl set interface ip address TenGigabitEthernet3/0/0.1235 10.0.3.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0.1235 2001:db8:0:3::1/64
vppctl create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match
vppctl set interface state TenGigabitEthernet3/0/0.1236 up
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1236
vppctl set interface ip address TenGigabitEthernet3/0/0.1236 10.0.4.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0.1236 2001:db8:0:4::1/64
vppctl create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match
vppctl set interface state TenGigabitEthernet3/0/0.1237 up
vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1237
vppctl set interface ip address TenGigabitEthernet3/0/0.1237 10.0.5.1/30
vppctl set interface ip address TenGigabitEthernet3/0/0.1237 2001:db8:0:5::1/64
## The LACP bond
vppctl create bond mode lacp load-balance l2
vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2
vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3
vppctl lcp create BondEthernet0 host-if be0
vppctl set interface state TenGigabitEthernet3/0/2 up
vppctl set interface state TenGigabitEthernet3/0/3 up
vppctl set interface state BondEthernet0 up
vppctl set interface mtu packet 9000 BondEthernet0
vppctl set interface ip address BondEthernet0 10.1.1.1/30
vppctl set interface ip address BondEthernet0 2001:db8:1:1::1/64
vppctl create sub BondEthernet0 1234
vppctl set interface mtu packet 9000 BondEthernet0.1234
vppctl set interface state BondEthernet0.1234 up
vppctl set interface ip address BondEthernet0.1234 10.1.2.1/30
vppctl set interface ip address BondEthernet0.1234 2001:db8:1:2::1/64
vppctl create sub BondEthernet0 1235 dot1q 1234 inner-dot1q 1000 exact-match
vppctl set interface state BondEthernet0.1235 up
vppctl set interface mtu packet 9000 BondEthernet0.1235
vppctl set interface ip address BondEthernet0.1235 10.1.3.1/30
vppctl set interface ip address BondEthernet0.1235 2001:db8:1:3::1/64
vppctl create sub BondEthernet0 1236 dot1ad 2345 exact-match
vppctl set interface state BondEthernet0.1236 up
vppctl set interface mtu packet 9000 BondEthernet0.1236
vppctl set interface ip address BondEthernet0.1236 10.1.4.1/30
vppctl set interface ip address BondEthernet0.1236 2001:db8:1:4::1/64
vppctl create sub BondEthernet0 1237 dot1ad 2345 inner-dot1q 1000 exact-match
vppctl set interface state BondEthernet0.1237 up
vppctl set interface mtu packet 9000 BondEthernet0.1237
vppctl set interface ip address BondEthernet0.1237 10.1.5.1/30
vppctl set interface ip address BondEthernet0.1237 2001:db8:1:5::1/64
```
#### Final note
You may have noticed that the [commit] links are all git commits in my private working copy.
I want to wait until my [previous work](https://gerrit.fd.io/r/c/vpp/+/33481) is reviewed and
submitted before piling on more changes. Feel free to contact vpp-dev@ for more information in the
mean time :-)

View File

@ -0,0 +1,520 @@
---
date: "2021-08-25T08:55:14Z"
title: VPP Linux CP - Part4
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. One thing notably missing, is the higher level control plane, that is
to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a
VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network
devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols
like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet
forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use
VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the
interface state (links, addresses and routes) itself. When the plugin is completed, running software
like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
&gt;100Mpps and &gt;100Gbps forwarding rates will be well in reach!
In the first three posts, I added the ability for VPP to synchronize its state (like link state,
MTU, and interface addresses) into Linux. In this post, I'll make a start on the other direction:
allowing changes to interfaces made in Linux to make their way back into VPP!
## My test setup
I'm keeping the setup from the [third post]({% post_url 2021-08-15-vpp-3 %}). A Linux machine has an
interface `enp66s0f0` which has 4 sub-interfaces (one dot1q tagged, one q-in-q, one dot1ad tagged,
and one q-in-ad), giving me five flavors in total. Then, I created an LACP `bond0` interface, which
also has the whole kit and caboodle of sub-interfaces defined, see below in the Appendix for details,
but here's the table again for reference:
| Name | type | Addresses
|-----------------|------|----------
| enp66s0f0 | untagged | 10.0.1.2/30 2001:db8:0:1::2/64
| enp66s0f0.q | dot1q 1234 | 10.0.2.2/30 2001:db8:0:2::2/64
| enp66s0f0.qinq | outer dot1q 1234, inner dot1q 1000 | 10.0.3.2/30 2001:db8:0:3::2/64
| enp66s0f0.ad | dot1ad 2345 | 10.0.4.2/30 2001:db8:0:4::2/64
| enp66s0f0.qinad | outer dot1ad 2345, inner dot1q 1000 | 10.0.5.2/30 2001:db8:0:5::2/64
| bond0 | untagged | 10.1.1.2/30 2001:db8:1:1::2/64
| bond0.q | dot1q 1234 | 10.1.2.2/30 2001:db8:1:2::2/64
| bond0.qinq | outer dot1q 1234, inner dot1q 1000 | 10.1.3.2/30 2001:db8:1:3::2/64
| bond0.ad | dot1ad 2345 | 10.1.4.2/30 2001:db8:1:4::2/64
| bond0.qinad | outer dot1ad 2345, inner dot1q 1000 | 10.1.5.2/30 2001:db8:1:5::2/64
The goal of this post is to show what code needed to be written and introduces an entirely _new
plugin_, so that we can separate concerns (and have a higher chance of community acceptance
of the plugins). In the first plugin, now called the **Interface Mirror**, I have previously
implemented the VPP-to-Linux synchronization. In this new plugin (called the **Netlink Listener**)
I implement the Linux-to-VPP synchronization using, _quelle surprise_, Netlink message handlers.
### Startingpoint
Based on the state of the plugin after the [third post]({% post_url 2021-08-15-vpp-3 %}),
operators can enable `lcp-sync` (which copies changes made in VPP into their Linux counterpart)
and `lcp-auto-subint` (which extends sub-interface creation in VPP to automatically create a
Linux Interface Pair, or _LIP_, and its companion Linux network interface):
```
DBGvpp# lcp lcp-sync on
DBGvpp# lcp lcp-auto-subint on
DBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0
DBGvpp# create sub TenGigabitEthernet3/0/0 1234
DBGvpp# create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match
DBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match
DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match
pim@hippo:~/src/lcpng$ ip link | grep e0
1286: e0.1234@e0: <BROADCAST,MULTICAST,M-DOWN> mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
1287: e0.1235@e0.1234: <BROADCAST,MULTICAST,M-DOWN> mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
1288: e0.1236@e0: <BROADCAST,MULTICAST,M-DOWN> mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
1289: e0.1237@e0.1236: <BROADCAST,MULTICAST,M-DOWN> mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
1701: e0: <BROADCAST,MULTICAST> mtu 9050 qdisc mq state DOWN mode DEFAULT group default qlen 1000
```
The vision for this plugin has been that Linux can drive most control-plane operations, such as
creating sub-interfaces, adding/removing addresses, changing MTU on links, etc. We can do that by
listening to [Netlink](https://en.wikipedia.org/wiki/Netlink) messages, which were designed for
transferring miscellaneous networking information between the kernel space and userspace processes
(like `VPP`). Networking utilities, such as the _iproute2_ family and its command line utilities
(like `ip`) use Netlink to communicate with the Linux kernel from userspace.
## Netlink Listener
The first task at hand is to install a Netlink listener. In this new plugin, I first register
`lcp_nl_init()` which adds Linux interface pair (_LIP_) add/del callbacks from the first plugin.
I'm now made aware of new _LIPs_ as they are created.
In `lcb_nl_pair_add_cb()`, I will initiate Netlink listener for first interface that gets created,
noting its netns. If subsequent adds are in other netns, I'll just issue a warning. And, I will keep
a refcount so I know how many _LIPs_ are bound to this listener.
In `lcb_nl_pair_del_cb()`, I can remove the listener when the last interface pair is removed.
Then for listening itself, a Netlink socket is opened, and because Linux can be quite chatty on
Netlink sockets, I'll raise its read/write buffers to something quite large (typically 64M read
and 16K write size). One note on this size, it'll need some sysctl to be set before VPP starts,
typically done as follows:
```
pim@hippo:~/src/vpp$ cat << EOF | sudo tee /etc/sysctl.d/81-vpp-Netlink.conf
# Increase Netlink to 64M
net.core.rmem_default=67108864
net.core.wmem_default=67108864
net.core.rmem_max=67108864
net.core.wmem_max=67108864
EOF
pim@hippo:~/src/vpp$ sudo sysctl -p
```
After creating the Netlink socket, I add its file descriptor to VPP's built in file handler, which
will see to polling it. On the file handler, I install `lcp_nl_read_cb()` and `lcp_nl_error_cb()`
callbacks which will be invoked when anything interesting happens on the socket:
A bit of explanation on why I'd use a queue rather than just consuming the Netlink messages directly
as they are offered. I _have to_ use a queue for the common case in which VPP is running single threaded.
Instead of consuming a block of potentially a million route del/add's (say, if BGP is reconverging),
and thereby blocking VPP from reading new packets from DPDK, but more importantly, new Netlink
messages from the kernel, which will fill the 64M socket buffer and overflow it, losing Netlink messages,
which is bad because it requires an end to end resync of the Linux namespace into the VPP dataplane,
something called an `NLM_F_DUMP` but that's a story for another day.
So I process only a batch of messages and only for a maximum amount of time per batch. If there are still
some messages left in the queue, I'll just reschedule consumption after M milliseconds. This allows new
Netlink messages to continuously be read from the kernel by VPP's file handler, even if there's a lot of
work to do.
* `lcp_nl_read_cb()` calls `lcp_nl_callback()` which pushes Netlink messages onto a queue and
issues a `NL_EVENT_READ` event, any socket read error issues `NL_EVENT_READ_ERR` event.
* `lcp_nl_error_cb()` simply issues `NL_EVENT_READ_ERR` event and moves on with life.
To capture these events, I initialize a process node called `lcp_nl_process()`, which handles:
* `NL_EVENT_READ` by calling `lcp_nl_process_msgs()` and processing a batch of messages (either
a maximum count, or a maximum duration, whichever is reached first).
* `NL_EVENT_READ_ERR` is the other event that can happen, in case VPP's file handler or my own
`lcp_nl_read_cb()` encounter a read error. All it does is close and reopen the Netlink socket
in the same network namespace we were before, in an attempt to minimize the damage, _dazed and
confused, but trying to continue_.
Allright, so at this point, I have a producer queue that gets added to by the Netlink reader
machinery, so all I have to do is consume them. `lcp_nl_process_msgs()` processes up to N messages
and/or for up to M msecs, whichever comes first, and for each individual Netlink message, it
will call `lcp_nl_dispatch()` to handle messages of a given type.
For now, `lcp_nl_dispatch()` just throws the message away after logging it with `format_nl_object()`,
a function that will come in very useful as I start to explore all the different Netlink message types.
The code that forms the basis of our Netlink Listener lives in [[this
commit](https://github.com/pimvanpelt/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
specifically, here I want to call out I was not the primary author, I worked off of Matt and Neale's
awesome work in this pending [Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122).
### Netlink: Neighbor
ARP and IPv6 Neighbor Discovery will trigger a set of Netlink messages, which are of type
`RTM_NEWNEIGH` and `RTM_DELNEIGH`
First, I'll add a new source file `lcpng_nl_sync.c` that will house these handler functions.
Their purpose is to take state learned from Netlink messages, and apply that state to VPP.
Then, I add `lcp_nl_neigh_add()` and `lcp_nl_neigh_del()` which implement the following
pattern: Most Netlink messages are somehow about a `link`, which is identified by an
interface index (`ifindex` or just idx for short). That's the same interface index I stored
when I created the _LIP_, calling it `vif_index` because in VPP, it describes a `virtio`
device which implements the IO for the TAP.
If I'm handling a message for link with a given ifindex, I can correlate it with a _LIP_. Not all
messages will be related to something VPP knows or cares about, I'll discuss that more later when
I discuss `RTM_NEWLINK` messages.
If there is no _LIP_ associated with the `ifindex`, then clearly this message is about a
Linux interface VPP is not aware of. But, if I can find the _LIP_, I can convert the lladdr
(MAC address) and IP address from the Netlink message into their VPP variants, and then simply
add or remove the ip4/ip6 neighbor adjacency.
The code for this first Netlink message handler lives in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
ironic insight is that after writing the code, I don't think any of it will be necessary, because
the interface plugin will already copy ARP and IPv6 ND packets back and forth and itself update its
neighbor adjacency tables; but I'm leaving the code in for now.
### Netlink: Address
A decidedly more interesting message is `RTM_NEWADDR` and its deletion companion `RTM_DELADDR`.
It's pretty straight forward to add and remove IPv4 and IPv6 addresses on interfaces. I have
to convert the Netlink representation of an IP address to its VPP counterpart with a helper, add
it or remove it, and if there are no link-local addresses left, disable IPv6 on the interface.
There's also a few multicast routes to add (notably 224.0.0.0/24 and ff00::/8, all-local-subnet).
The code for IP address handling is in this
[[commit]](https://github.com/pimvanpelt/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
when I took it out for a spin, I noticed something curious, looking at the log lines that are
generated for the following sequence:
```
ip addr add 10.0.1.1/30 dev e0
debug linux-cp/nl addr_add: Netlink route/addr: add idx 1488 family inet local 10.0.1.1/30 flags 0x0080 (permanent)
warn linux-cp/nl dispatch: ignored route/route: add family inet type 2 proto 2 table 255 dst 10.0.1.1 nexthops { idx 1488 }
warn linux-cp/nl dispatch: ignored route/route: add family inet type 1 proto 2 table 254 dst 10.0.1.0/30 nexthops { idx 1488 }
warn linux-cp/nl dispatch: ignored route/route: add family inet type 3 proto 2 table 255 dst 10.0.1.0 nexthops { idx 1488 }
warn linux-cp/nl dispatch: ignored route/route: add family inet type 3 proto 2 table 255 dst 10.0.1.3 nexthops { idx 1488 }
ping 10.0.1.2
debug linux-cp/nl neigh_add: Netlink route/neigh: add idx 1488 family inet lladdr 68:05:ca:32:45:94 dst 10.0.1.2 state 0x0002 (reachable) flags 0x0000
notice linux-cp/nl neigh_add: Added 10.0.1.2 lladdr 68:05:ca:32:45:94 iface TenGigabitEthernet3/0/0
ip addr del 10.0.1.1/30 dev e0
debug linux-cp/nl addr_del: Netlink route/addr: del idx 1488 family inet local 10.0.1.1/30 flags 0x0080 (permanent)
notice linux-cp/nl addr_del: Deleted 10.0.1.1/30 iface TenGigabitEthernet3/0/0
warn linux-cp/nl dispatch: ignored route/route: del family inet type 1 proto 2 table 254 dst 10.0.1.0/30 nexthops { idx 1488 }
warn linux-cp/nl dispatch: ignored route/route: del family inet type 3 proto 2 table 255 dst 10.0.1.3 nexthops { idx 1488 }
warn linux-cp/nl dispatch: ignored route/route: del family inet type 3 proto 2 table 255 dst 10.0.1.0 nexthops { idx 1488 }
warn linux-cp/nl dispatch: ignored route/route: del family inet type 2 proto 2 table 255 dst 10.0.1.1 nexthops { idx 1488 }
debug linux-cp/nl neigh_del: Netlink route/neigh: del idx 1488 family inet lladdr 68:05:ca:32:45:94 dst 10.0.1.2 state 0x0002 (reachable) flags 0x0000
error linux-cp/nl neigh_del: Failed 10.0.1.2 iface TenGigabitEthernet3/0/0
```
It is this very last message that's a bit of a surprise -- the ping brought the peer's
lladdr into the neighbor cache; and the subsequent address deletion first removed the address,
then all the typical local routes (the connected, the broadcast, the network, and the self/local);
but then as well explicitly deleted the neighbor, which I suppose is correct behavior for Linux,
were it not that VPP already invalidates the neighbor cache and adds/removes the connected routes
for example in `ip/ip4_forward.c` L826-L830 and L583.
I can see more of these false positive non-errors like the one on `lcp_nl_neigh_del()` because
interface and directly connected route addition/deletion is slightly different in VPP than in Linux.
So, I decide to take a little shortcut -- if an addition returns "already there", or a deletion returns
"no such entry", I'll just consider it a successful addition and deletion respectively, saving my eyes
from being screamed at by this red error message. I changed that in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
turning this situation in a friendly green notice instead.
### Netlink: Link (existing)
There's a bunch of use cases for these messages `RTM_NEWLINK` and `RTM_DELLINK`. They carry information
about carrier (link, no-link), admin state (up/down), MTU, and so on. The function `lcp_nl_link_del()`
is the easier of the two. If I see a message like this for an ifindex that VPP has a _LIP_ for, I'll
just remove it. This means first calling the `lcp_itf_pair_delete()` function and then, if the message
was for a VLAN interface, remove the accompanying sub-interface (both the physical one (eg. `TenGigabitEthernet3/0/0.1234`)
as well as the TAP that we used to communicate to the host with (eg. `tap8.1234`).
The other message (the `RTM_NEWLINK` one), is much more complicated, because it's actually many types
of operation all in one message type: We can set the link up/down, change its MTU, and change its MAC
address, in any combination, perhaps like so:
```
ip link set e0 mtu 9216 address 00:11:22:33:44:55 down
```
So in turn, `lcp_nl_link_add()` will first look at admin state and apply it to the phy and tap,
apply the MTU if it's different to what VPP has, and apply the MAC address if it's different to
what VPP has, notably applying MAC addresses only in 'hardware' interfaces, which I now know are
not just physical ones like `TenGigabitEthernet3/0/0` but also virtual ones like `BondEthernet0`.
One thing I noticed, is that link state and MTU changes tend to go around in circles (from Netlink
into VPP, with this code, but when `lcp-sync` is on in the interface mirror plugin, changes to link
and mtu will trigger a callback there, which will in turn generate a Netlink message, and so on).
To avoid this loop, I temporarily turn off `lcp-sync` just before handling a batch of messages, and
turn it back to its original state when I'm done with that.
The code for all/del of existing links is in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
### Netlink: Link (new)
Here's where it gets interesting! What if the `RTM_NEWLINK` message was for an interface that VPP
doesn't have a _LIP_ for, but specifically describes a VLAN interface? Well, then clearly the operator
is trying to create a new sub-interface. And supporting that operation would be super cool, so let's go!
Using the earlier placeholder hint in `lcp_nl_link_add()` (see the previous
[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
I know that I've gotten a NEWLINK request but the Linux ifindex doesn't have a _LIP_. This could be
because the interface is entirely foreign to VPP, for example somebody created a dummy interface or
a VLAN sub-interface on one:
```
ip link add dum0 type dummy
ip link add link dum0 name dum0.10 type vlan id 10
```
Or perhaps more interestingly, the operator is actually trying to create a VLAN sub-interface on an
interface we created in VPP earlier, like these:
```
ip link add link e0 name e0.1234 type vlan id 1234
ip link add link e0.1234 name e0.1235 type vlan id 1000
ip link add link e0 name e0.1236 type vlan id 2345 proto 802.1ad
ip link add link e0.1236 name e0.1237 type vlan id 1000
```
None of these `RTM_NEWLINK` messages, represented by vif (Linux ifindex) will have a corresponding _LIP_.
So, I try to _create one_ by calling `lcp_nl_link_add_vlan()`.
First, I'll lookup the parent ifindex (`dum0` or `e0` in the examples above). The first example parent,
`dum0`, doesn't have a _LIP_, so I bail after logging a warning. The second example however, `e0`,
definitely does have a _LIP_, so it's known to VPP.
Now, I have two further choices:
1. the _LIP_ is a phy (ie `TenGigabitEthernet3/0/0` or `BondEthernet0`) and this is a regular tagged
interface with a given proto (dot1q or dot1ad); or
1. the _LIP_ is itself a subint (ie `TenGigabitEthernet3/0/0.1234`) and what I'm being asked for is
actually a QinQ or QinAD sub-interface. Remember, there's an important difference:
- In Linux these sub-interfaces are chained (`e0` creates child `e0.1234@e0` for a normal VLAN,
and `e0.1234` creates child `e0.1235@e0.1234` for the QinQ).
- In VPP these are actually all flat sub-interfaces, with the 'regular' VLAN interface carrying
the `one_tag` flag with only an `outer_vlan_id` set, and the latter QinQ carrying the `two_tags`
flag with both an `outer_vlan_id` (1234) and an `inner_vlan_id` (1000).
So I look up both the parent _LIP_ as well the phy _LIP_. I now have all the ingredients I need to create
the VPP sub-interfaces with the correct inner-dot1q and outer dot1q or dot1ad.
Of course, I don't really know what subinterface ID to use. It's appealing to "just" use the vlan id,
but that's not helpful if the outer tag and the inner tag are the same. So I write a helper function
`vnet_sw_interface_get_available_subid()` whose job it is to return an unused subid for the phy,
starting from 1.
Here as well, the interface plugin can be configured to automatically create _LIPs_ for sub-interfaces,
which I have to turn off temporarily to let my new form of creation do its thing. I carefully ensure that
the thread barrier is taken/released and the original setting of `lcp-auto-subint` is restored at all
exit points. One cool thing is that the new link's name is given in the Netlink message, so I can just
use that one. I like the aesthetic a bit more, because here the operator can give the Linux interface
any name they like, where-as in the other direction, VPP's `lcp-auto-subint` feature has to make up
a boring `<phy>.<subid>` name.
Alright, without further ado, the code for the main innovation here, the implementation of
`lcp_nl_link_add_vlan()`, is in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
## Results
The functional regression test I made on day one, the one that ensures end-to-end connectivity to and
from the Linux host interfaces works for all 5 interface types (untagged, .1q tagged, QinQ, .1ad tagged
and QinAD) and for both physical and virtual interfaces (like `TenGigabitEthernet3/0/0` and `BondEthernet0`),
still works.
After this code is in, the operator will only have to create a _LIP_ for any phy interfaces, and
can rely on the new Netlink Listener plugin and the use of `ip` in Linux for all the rest. This
implementation starts approaching 'vanilla' Linux user experience!
Here's [a screencast](https://asciinema.org/a/432243) showing me playing around a bit, demonstrating
that synchronization works pretty well in both directions, a huge improvement from the
[previous screencast](https://asciinema.org/a/430411) in my [second post]({% post_url 2021-08-13-vpp-2 %}),
which was only two weeks ago:
<script id="asciicast-432243" src="https://asciinema.org/a/432243.js" async></script>
### Further Work
You will note that there's one important Netlink message type that's missing: routes! They are so
important in fact, that they're a topic of their very own post. Also, I haven't written the code
for them yet :-)
A few things worth noting, as future work.
**Multiple NetNS** - The original Netlink Listener ([ref](https://gerrit.fd.io/r/c/vpp/+/31122)) would
only listen to the default netns specified in the configuration file. This is problematic because the
interface plugin does allow interfaces to be made in other namespaces (by issuing
`lcp create ... host-if X netns foo`), the Netlink world of which will be unknown to VPP. I
created `struct lcp_nl_netlink_namespace` to hold the stuff needed for the Netlink listener,
which is a good starting point to create not one but multiple listeners, one for each unique
namespace that has one or more _LIPs_ defined. This is version-two work :)
**Multithreading** - In testing, I noticed that while my plugin itself are (or seem to be..) thread
safe, `virtio` may not be totally clean, and I noticed that in a multithreaded VPP instance with many
workers, there's a crash in `lcp_arp_phy_node()` where `vlib_buffer_copy()` returns NULL, which should
not happen. When VPP is in such a state, other plugins (notably DHCP and IPv6 ND) also start complaining,
and `show errors` shows millions of `virtio-input` errors about unavailable buffers.
I do confirm though, that running VPP single threaded does not have these issues.
## Credits
I'd like to make clear that the Linux CP plugin is a collaboration between several great minds,
and that my work stands on other software engineer's shoulders. In particular most of the Netlink
socket handling and Netlink message queueing was written by Matthew Smith, and I've had a little bit
of help along the way from Neale Ranns and Jon Loeliger. I'd like to thank them for their work!
## Appendix
#### Ubuntu config
This configuration has been the exact same ever since [my first post]({% post_url 2021-08-12-vpp-1 %}):
```
# Untagged interface
ip addr add 10.0.1.2/30 dev enp66s0f0
ip addr add 2001:db8:0:1::2/64 dev enp66s0f0
ip link set enp66s0f0 up mtu 9000
# Single 802.1q tag 1234
ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234
ip link set enp66s0f0.q up mtu 9000
ip addr add 10.0.2.2/30 dev enp66s0f0.q
ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q
# Double 802.1q tag 1234 inner-tag 1000
ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000
ip link set enp66s0f0.qinq up mtu 9000
ip addr add 10.0.3.2/30 dev enp66s0f0.qinq
ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq
# Single 802.1ad tag 2345
ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad
ip link set enp66s0f0.ad up mtu 9000
ip addr add 10.0.4.2/30 dev enp66s0f0.ad
ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad
# Double 802.1ad tag 2345 inner-tag 1000
ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q
ip link set enp66s0f0.qinad up mtu 9000
ip addr add 10.0.5.2/30 dev enp66s0f0.qinad
ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad
## Bond interface
ip link add bond0 type bond mode 802.3ad
ip link set enp66s0f2 down
ip link set enp66s0f3 down
ip link set enp66s0f2 master bond0
ip link set enp66s0f3 master bond0
ip link set enp66s0f2 up
ip link set enp66s0f3 up
ip link set bond0 up
ip addr add 10.1.1.2/30 dev bond0
ip addr add 2001:db8:1:1::2/64 dev bond0
ip link set bond0 up mtu 9000
# Single 802.1q tag 1234
ip link add link bond0 name bond0.q type vlan id 1234
ip link set bond0.q up mtu 9000
ip addr add 10.1.2.2/30 dev bond0.q
ip addr add 2001:db8:1:2::2/64 dev bond0.q
# Double 802.1q tag 1234 inner-tag 1000
ip link add link bond0.q name bond0.qinq type vlan id 1000
ip link set bond0.qinq up mtu 9000
ip addr add 10.1.3.2/30 dev bond0.qinq
ip addr add 2001:db8:1:3::2/64 dev bond0.qinq
# Single 802.1ad tag 2345
ip link add link bond0 name bond0.ad type vlan id 2345 proto 802.1ad
ip link set bond0.ad up mtu 9000
ip addr add 10.1.4.2/30 dev bond0.ad
ip addr add 2001:db8:1:4::2/64 dev bond0.ad
# Double 802.1ad tag 2345 inner-tag 1000
ip link add link bond0.ad name bond0.qinad type vlan id 1000 proto 802.1q
ip link set bond0.qinad up mtu 9000
ip addr add 10.1.5.2/30 dev bond0.qinad
ip addr add 2001:db8:1:5::2/64 dev bond0.qinad
```
#### VPP config
We can whittle down the VPP configuration to the bare minimum:
```
vppctl lcp default netns dataplane
vppctl lcp lcp-sync on
vppctl lcp lcp-auto-subint on
## Create `e0`
vppctl lcp create TenGigabitEthernet3/0/0 host-if e0
## Create `be0`
vppctl create bond mode lacp load-balance l34
vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2
vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3
vppctl set interface state TenGigabitEthernet3/0/2 up
vppctl set interface state TenGigabitEthernet3/0/3 up
vppctl lcp create BondEthernet0 host-if be0
```
And the rest of the confifuration work is done entirely from the Linux side!
```
IP="sudo ip netns exec dataplane ip"
## `e0` aka TenGigabitEthernet3/0/0
$IP link add link e0 name e0.1234 type vlan id 1234
$IP link add link e0.1234 name e0.1235 type vlan id 1000
$IP link add link e0 name e0.1236 type vlan id 2345 proto 802.1ad
$IP link add link e0.1236 name e0.1237 type vlan id 1000
$IP link set e0 up mtu 9000
$IP addr add 10.0.1.1/30 dev e0
$IP addr add 2001:db8:0:1::1/64 dev e0
$IP addr add 10.0.2.1/30 dev e0.1234
$IP addr add 2001:db8:0:2::1/64 dev e0.1234
$IP addr add 10.0.3.1/30 dev e0.1235
$IP addr add 2001:db8:0:3::1/64 dev e0.1235
$IP addr add 10.0.4.1/30 dev e0.1236
$IP addr add 2001:db8:0:4::1/64 dev e0.1236
$IP addr add 10.0.5.1/30 dev e0.1237
$IP addr add 2001:db8:0:5::1/64 dev e0.1237
## `be0` aka BondEthernet0
$IP link add link be0 name be0.1234 type vlan id 1234
$IP link add link be0.1234 name be0.1235 type vlan id 1000
$IP link add link be0 name be0.1236 type vlan id 2345 proto 802.1ad
$IP link add link be0.1236 name be0.1237 type vlan id 1000
$IP link set be0 up mtu 9000
$IP addr add 10.1.1.1/30 dev be0
$IP addr add 2001:db8:1:1::1/64 dev be0
$IP addr add 10.1.2.1/30 dev be0.1234
$IP addr add 2001:db8:1:2::1/64 dev be0.1234
$IP addr add 10.1.3.1/30 dev be0.1235
$IP addr add 2001:db8:1:3::1/64 dev be0.1235
$IP addr add 10.1.4.1/30 dev be0.1236
$IP addr add 2001:db8:1:4::1/64 dev be0.1236
$IP addr add 10.1.5.1/30 dev be0.1237
$IP addr add 2001:db8:1:5::1/64 dev be0.1237
```
#### Final note
You may have noticed that the [commit] links are all to git commits in my private working copy. I
want to wait until my [previous work](https://gerrit.fd.io/r/c/vpp/+/33481) is reviewed and
submitted before piling on more changes. Feel free to contact vpp-dev@ for more information in the
mean time :-)

View File

@ -0,0 +1,214 @@
---
date: "2021-08-26T12:55:44Z"
title: Fiber7-X in 1790BRE
---
## Introduction
I've been a very happy Init7 customer since 2016, when the fiber to the home
ISP I was a subscriber at back then, a small company called Easyzone, got
acquired by Init7. The technical situation in Wangen-Br&uuml;ttisellen was
a bit different back in 2016. There was a switch provided by Litecom in which
ports were resold OEM to upstream ISPs, and Litecom would provide the L2
backhaul to a central place to hand off the customers to the ISPs, in my case
Easyzone. In Oct'16, Fredy asked me if I could do a test of
Fiber7-on-Litecom, which I did and reported on in a [blog post]({% post_url 2016-10-07-fiber7-litexchange %}).
Some time early 2017, Init7 deployed a POP in Dietlikon (790BRE) and then
magically another one in Br&uuml;ttisellen (1790BRE). It's a funny story
why the Dietlikon point of presence is called 790BRE, but I'll leave that
for the bar, not this post :-)
## Fiber7's Next Gen
Some of us read a rather curious tweet in back in May:
{{< image width="400px" float="center" src="/assets/fiber7-x/tweet.png" alt="Tweet-X2" >}}
Translated -- _'7 years ago our Gigabit-Internet was born. To celebrate this day,
here's a riddle for #Nerds: Gordon Moore's law says dictates doubling every 18 months.
What does that mean for our 7 year old Fiber7?'_ Well, 7 years is 84 months, and
doubling every 18 months means 84/18 = 4.6667 doublings and 1Gbpbs * 2^4.6667 =
25.4Gbps. Holy shitballs, Init7 just announced that their new platform will offer 25G
symmetric ethernet?!
"I wonder what that will cost?", I remember myself thinking. "**The same price**",
was the answer. I can see why -- monitoring my own family's use, we're doing a good
60Mbit or so when we stream Netflix and/or Spotify (which we all do daily). And some
IPTV maybe at 4k will go for a few hundred megs, but the only time we actually use
the gigabit, is when we do a speedtest of an iperf :-) Moreover, offering 25G fits
the company's marketing strategy well, because our larger Swiss national telco and
cable providers are all muddying the waters with their DOCSIS and GPON offering,
both of which _can_ do 10Gbit, but it's a TDM (time division multiplexing) offering
which makes any number of subscribers share that bandwidth to a central office. And
when I say any number, it's easy to imagine 128 and 256 subscribers on one XGSPON,
and many of those transponders in a telco line terminator, each with redundant uplinks
of 2x10G or sometimes 2x40G. But that's an oversubscription of easily 2000x, taking
128 (subscribers per PON) x16 (PONs per linecard) x8 (linecards), is 16K subscribers
of 10G using 80G (or only 20G) of uplink bandwidth. That's massively inferior from
a technical perspective. And, as we'll see below, it doesn't really allow for
advanced services, like L2 backhaul from the subscriber to a central office.
Now to be fair, the 1790BRE pop that I am personally connected to has 2x 10G uplinks
and ~200 or so 1G downlinks, which is also a local overbooking of 10:1, or 20:1 if only
one of the uplinks is used at any given time. Worth noting, sometimes several cities
are daisy chained, which makes for larger overbooking if you're deep in the Fiber7
access network. I am pretty close (790BRE-790SCW-790OER-Core; and an alternate path of
780EFF-Core; only one of which is used because the Fiber7 edge switches use OSPF and
a limited TCAM space means only few if any public routes are there; I assume a default
is injected into OSPF at every core site and limited traffic engineering is done).
The longer the traceroute, the cooler it looks, but the more customers are ahead of
you, causing more overbooking. YMMV ;-)
## Upgrading 1790BRE
{{< image width="300px" float="right" src="/assets/fiber7-x/before.png" alt="Before" >}}
Wouldn't it be cool if Init7 upgraded to 100G intra-pop? Well, this is the story
of their Access'21 project! My buddy Pascal (who is now the CTO at Init7, good
choice!), explained it to me in a business call back in June, but also shared it
in a [presentation](/assets/fiber7-x/UKNOF_20210803.pdf) which I definitely encourage
you to browse through. If you thought I was jaded on GPON, check out their assessment,
it's totally next level!
Anyway, the new POPs are based on Cisco's C9500 switches, which come in two variants:
Access switches are C9500-48Y4C which take 48x SPF28 (1/10/25Gbit) and 4x QSFP+ (40/100Gbit)
and aggregation switches are C9500-32C which take 32x QSFP+ (40/100Gbit).
As a subscriber, we all got a courtesy headsup on the date of 1790BRE's upgrade.
It was [scheduled](https://as13030.net/status/?ticket=4238550) for Thursday Aug 26th
starting at midnight. As I've written about before (for example at the bottom of my
[Bucketlist post]({% post_url 2021-07-26-bucketlist %})), I really enjoy the immediate
gratification of physical labor in a datacenter. Most of my projects at work are on the
quarters-to-years timeframe, and being able to do a thing and see the result of that
thing ~immmediately, is a huge boost for me.
So I offered to let one of the two Init7 people take the night off and help perform the
upgrade myself. The picture on the right is how the switch looked like until now, with
four linecards of 48x1G trunked into 2x10G uplinks, one towards Effretikon and one
towards Dietlikon. It's an aging Cisco 4510 switch (they were released around 2010),
but it has served us well here in Br&uuml;ttisellen for many years, thank you, little
chassis!
## The Upgrade
{{< image width="300px" float="right" src="/assets/fiber7-x/during.png" alt="During" >}}
I met the Init7 engineer in front of the Werke Wangen-Br&uuml;ttisellen, which is about
170m from my house, as the photons fly, at around 23:30. We chatted for a little while,
I had already gotten to know him due to mutual hosting at NTT in R&uuml;mlang, so of
course our basement ISPs peer over [CommunityIX](https://communityix.ch/) and so on,
but it's cool to put a face to the name.
The new switches were already racked by Pascal previously, and DWDM multiplexers have
appeared, and that what used to be a simplex fiber, is now two pairs of duplex fibers.
Maybe DWDM services are in reach for me at some point? I should look in to that ... but
for now let's focus on the task at hand.
In the picture on the right, you can see from top to bottom: DWDM mux to ZH11/790ZHB
which immediately struck my eye as clever - it's a 8 channel DWDM mux with channels C31-C38
and two wideband passthroughs, one is 1310W which means "a wideband 1310nm" which is where
the 100G optics are sending; and the other is UPG which is an upgrade port, allowing to add
more DWDM channels in a separate mux into the fiber at a later date, at the expense of
2dB or so of insertion loss. Nice. The second is an identical unit, a DWDM mux to 780EFF
which has again one 100G 1310nm wideband channel towards Effretikon and then on to
Winterthur, and CH31 in use with what is the original C4510 switch (that link used to
be a dark fiber with vanilla 10G optics connecting 1790BRE with 780EFF).
Then there are two redundant aggregation switches (the 32x100G kind), which have each
four access switches connected to them, with the pink cables. Those are interesting:
to make 100G very cheap, optics can make use of 4x25G lasers that each take one fiber,
so 8 fibers in total, and those pink cables are 12-fiber multimode trunks with an
[MPO](https://vitextech.com/mpo-mtp-connectors-difference/) connector. The optics for this
type of connection are super cheap, for example this [Flexoptix](https://www.flexoptix.net/en/qsfp28-sr4-transceiver-100-gigabit-mm-850nm-100m-1db-ddm-dom.html?co8502=76636) one. I have the 40G variant at home, also running multimode
4x10G MPO cables, at a fraction of the price of singlemode single-laser variants. So
when people say "multimode is useless, always use singlemode", point them at this post
please!
{{< image width="300px" float="left" src="/assets/fiber7-x/after.png" alt="After" >}}
There were 11 subscribers who upgraded their service, ten of them to 10Gbps (myself
included) and one of them to 25Gbps, lucky bastard. So in a first pass we shut down all
the ports on the C4510 and moved over optics and fibers one by one into the new C9500
switches, of which there were four.
Werke Wangen-Br&uuml;ttisellen (the local telcoroom owners in my town) historically did do
a great job at labeling every fiber with little numbered clips, so it's easy to ensure
that what used to be fiber #33, is now still in port #33. I worked from the right,
taking two optics from the old switch, moving them into the new switch, and reinserting
the fibers. The Init7 engineer worked from the left, doing the same. We managed to
complete this swap-over in record time, according to Pascal who was monitoring from
remote, and reconfiguring the switches to put the subscribers back into service. We
started at 00:05 and completed the physical reconfiguration at 01:21am. Go, us!
After the physical work, we conducted an Init7 post-maintenance ritual which was eating
a cheeseburger to replenish our body's salt and fat contents. We did that at my place
and luckily I have access to a microwave oven and also some Blairs Mega Death hotsauce
(with liquid rage) which my buddy enthusiastically drizzled onto the burger, but it did
make him burp just a little bit as sweat poured out of his face. That was fun! I took
some more pictures, published with permission, in [this album](https://photos.app.goo.gl/VozxYvnuXSQPBePG7).
<hr />
### 10G VLL
One more thing! I had waited to order this until the time was right, and the upgrade of
1790BRE was it -- since I operate AS50869, a little basement ISP, I had always hoped to
change my 1500 byte MTU L3 service into a Jumboframe capable L2 service. After some
negotiation on the contractuals, I signed an order ahead of this maintenance to upgrade
to a 10G virtual leased line (VLL) from this place to the NTT datacenter in R&uuml;mlang.
In the afternoon, I had already patched my side of the link in the datacenter, and I
noticed that the Init7 side of the patch was dangling in their rack without an optic. So
we went to the datacenter (at 2am, the drive from my house to NTT is 9 minutes, without
speeding!), and plugged in an optic to let my lonely photons hit a friendly receiver.
I then got to configure the VLL together with my buddy, which was a hilight of the night
for me. I now have access to a spiffy new 10 gigabit VLL operating at 9190 MTU, from
1790BRE directly to my router `chrma0.ipng.ch` at NTT R&uuml;mlang, while previously I
had secured a 1G carrier ethernet operating at 9000 MTU directly to my router
`chgtg0.ipng.ch` at Interxion Glattbrugg. Between the two sites, I have a CWDM wave
which currently runs 10G optics but I have the 25G CWDM optics and switches ready for
deployment. It's somewhat (ok, utterly) over the top, but I like (ok, love) it.
```
pim@chbtl0:~$ show protocols ospfv3 neighbor
Neighbor ID Pri DeadTime State/IfState Duration I/F[State]
194.1.163.4 1 00:00:38 Full/PointToPoint 87d05:37:45 dp0p6s0f3[PointToPoint]
194.1.163.86 1 00:00:31 Full/DROther 16:18:39 dp0p6s0f2.101[BDR]
194.1.163.87 1 00:00:30 Full/DR 7d15:48:41 dp0p6s0f2.101[BDR]
194.1.163.0 1 00:00:38 Full/PointToPoint 2d12:02:19 dp0p6s0f0[PointToPoint]
```
The latency from my workstation on which I'm writing this blogpost to, say, my Bucketlist
location of NIKHEF in the Amsterdam Watergraafsmeer, is pretty much as fast as light
goes (I've seen 12.2ms, but considering it's ~820km, this is not bad at all):
```
pim@chumbucket:~$ traceroute gripe
traceroute to gripe (94.142.241.186), 30 hops max, 60 byte packets
1 chbtl0.ipng.ch (194.1.163.66) 0.211 ms 0.186 ms 0.189 ms
2 chrma0.ipng.ch (194.1.163.17) 1.463 ms 1.416 ms 1.432 ms
3 defra0.ipng.ch (194.1.163.25) 7.376 ms 7.344 ms 7.330 ms
4 nlams0.ipng.ch (194.1.163.27) 12.952 ms 13.115 ms 12.925 ms
5 gripe.ipng.nl (94.142.241.186) 13.250 ms 13.337 ms 13.223 ms
```
And, due to the work we did above, now the bandwidth is up to par as well, with
comparable down- and upload speeds of 9.2Gbit from NL&gt;CH and 8.9Gbit from
CH&gt;NL, and, while I'm not going to prove it here, this would work equally
well with 9000 byte, 1500 byte or 64 byte frames due to my use of DPDK based
routers who just don't G.A.F. :
```
pim@chumbucket:~$ iperf3 -c nlams0.ipng.ch -R -P 10 ## Richtung Schweiz!
Connecting to host nlams0, port 5201
Reverse mode, remote host nlams0 is sending
...
[SUM] 0.00-10.01 sec 10.8 GBytes 9.26 Gbits/sec 53 sender
[SUM] 0.00-10.00 sec 10.7 GBytes 9.19 Gbits/sec receiver
pim@chumbucket:~$ iperf3 -c nlams0.ipng.ch -P 10 ## Naar Nederland toe!
Connecting to host nlams0, port 5201
...
[SUM] 0.00-10.00 sec 9.93 GBytes 8.87 Gbits/sec 405 sender
[SUM] 0.00-10.02 sec 9.91 GBytes 8.84 Gbits/sec receiver
```

View File

@ -0,0 +1,549 @@
---
date: "2021-09-02T12:19:14Z"
title: VPP Linux CP - Part5
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. One thing notably missing, is the higher level control plane, that is
to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a
VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network
devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols
like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet
forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use
VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the
interface state (links, addresses and routes) itself. When the plugin is completed, running software
like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
&gt;100Mpps and &gt;100Gbps forwarding rates will be well in reach!
In the previous post, I added support for VPP to consume Netlink messages that describe interfaces,
IP addresses and ARP/ND neighbor changes. This post completes the stablestakes Netlink handler by
adding IPv4 and IPv6 route messages, and ends up with a router in the DFZ consuming 133K IPv6
prefixes and 870K IPv4 prefixes.
## My test setup
The goal of this post is to show what code needed to be written to extend the **Netlink Listener**
plugin I wrote in the [fourth post]({% post_url 2021-08-25-vpp-4 %}), so that it can consume
route additions/deletions, a thing that is common in dynamic routing protocols such as OSPF and
BGP.
The setup from my [third post]({% post_url 2021-08-15-vpp-3 %}) is still there, but it's no longer
a focal point for me. I use it (the regular interface + subints and the BondEthernet + subints)
just to ensure my new code doesn't have a regression.
Instead, I'm creating two VLAN interfaces now:
- The first is in my home network's _servers_ VLAN. There are three OSPF speakers there:
- `chbtl0.ipng.ch` and `chbtl1.ipng.ch` are my main routers, they run DANOS and are in
the Default Free Zone (or DFZ for short).
- `rr0.chbtl0.ipng.ch` is one of AS50869's three route-reflectors. Every one of the 13
routers in AS50869 exchanges BGP information with these, and it cuts down on the total
amount of iBGP sessions I have to maintain -- see [here](https://networklessons.com/bgp/bgp-route-reflector)
for details on Route Reflectors.
- The second is an L2 connection to a local BGP exchange, with only three members (IPng Networks,
AS50869, Openfactory AS58299, and Stucchinet AS58280). In this VLAN, Openfactory was so kind
as to configure a full transit session for me, and I'll use it in my test bench.
The test setup offers me the ability to consume OSPF, OSPFv3 and BGP.
### Startingpoint
Based on the state of the plugin after the [fourth post]({% post_url 2021-08-25-vpp-4 %}),
operators can create VLANs (including .1q, .1ad, QinQ and QinAD subinterfaces) directly in
Linux. They can change link attributes (like set admin state 'up' or 'down', or change
the MTU on a link), they can add/remove IP addresses, and the system will add/remove IPv4
and IPv6 neighbors. But notably, the following Netlink messages are not yet consumed, as shown
by the following example:
```
pim@hippo:~/src/lcpng$ sudo ip link add link e1 name servers type vlan id 101
pim@hippo:~/src/lcpng$ sudo ip link up mtu 1500 servers
pim@hippo:~/src/lcpng$ sudo ip addr add 194.1.163.86/27 dev servers
pim@hippo:~/src/lcpng$ sudo ip ro add default via 194.1.163.65
```
which does the first three commands just fine, but the fourth:
```
linux-cp/nl [debug ]: dispatch: ignored route/route: add family inet type 1 proto 3
table 254 dst 0.0.0.0/0 nexthops { gateway 194.1.163.65 idx 197 }
```
In this post, I'll implement that last missing piece in two functions called `lcp_nl_route_add()`
and `lcp_nl_route_del()`. Here we go!
## Netlink Routes
Reusing the approach from the work-in-progress [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)], I introduce two FIB sources: one
for manual routes (ie. the ones that an operator might set with `ip route add`), and another one
for dynamic routes (ie. what a routing protocol like Bird or FRR might set), this is in
`lcp_nl_proto_fib_source()`. Next, I need a bunch of helper functions that can translate the
Netlink message information into VPP primitives:
- `lcp_nl_mk_addr46()` converts a Netlink `nl_addr` to a VPP `ip46_address_t`.
- `lcp_nl_mk_route_prefix()` converts a Netlink `rtnl_route` to a VPP `fib_prefix_t`.
- `lcp_nl_mk_route_mprefix()` converts a Netlink `rtnl_route` to a VPP `mfib_prefix_t` (for
multicast routes).
- `lcp_nl_mk_route_entry_flags()` generates `fib_entry_flag_t` from the Netlink route type,
table and proto metadata.
- `lcp_nl_proto_fib_source()` selects the most appropciate FIB source by looking at the
`rt_proto` field from the Netlink message (see `/etc/iproute2/rt_protos` for a list of
these). Anything **RTPROT_STATIC** or better is `fib_src`, while anything above that
becomes `fib_src_dynamic`.
- `lcp_nl_route_path_parse()` converts a Netlink `rtnl_nexthop` to a VPP `fib_route_path_t`
and adds that to a growing list of paths. Similar to Netlink's nethops being a list, so
are the individual paths in VPP, so that lines up perfectly.
- `lcp_nl_route_path_add_special()` adds a blackhole/unreach/prohibit route to the list
of paths, in the special-case there is not yet a path for the destination.
With these helpers, I will have enough to manipulate VPP's forwarding information base or _FIB_
for short. But in VPP, the _FIB_ consists of any number of _tables_ (think of them as _VRFs_
or Virtual Routing/Forwarding domains). So first, I need to add these:
- `lcp_nl_table_find()` selects the matching `{table-id,protocol}` (v4/v6) tuple from
an internally kept hash of tables.
- `lcp_nl_table_add_or_lock()` if a table with key `{table-id,protocol}` (v4/v6) hasn't
been used yet, create one in VPP, and store it for future reference. Otherwise increment
a table reference counter so I know how many FIB entries VPP will have in this table.
- `lcp_nl_table_unlock()` given a table, decrease the refcount on it, and if no more
prefixes are in the table, remove it from VPP.
All of this code was heavily inspired by the pending [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)]
but a few finishing touches were added, and wrapped up in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
### Deletion
Our main function `lcp_nl_route_del()` will remove a route from the given table-id/protocol.
I do this by applying `rtnl_route_foreach_nexthop()` callbacks to the list of Netlink message
nexthops, converting each of them into VPP paths in a `lcp_nl_route_path_parse_t` structure.
If the route is for unreachable/blackhole/prohibit in Linux, add that path too.
Then, remove the VPP paths from the FIB and reduce refcnt or remove the table if it's empty.
This is reasonably straight forward.
### Addition
Adding routes to the FIB is done with `lcp_nl_route_add()`. It immediately becomes obvious
that not all routes are relevant for VPP. A prime example are those in table 255, they are
'local' routes, which have already been set up by IPv4 and IPv6 address addition functions
in VPP. There are some other route types that are invalid, so I'll just skip those.
Link-local IPv6 and IPv6 multicast is also skipped, because they're also added when interfaces
get their IP addresses configured. But for the other routes, similar to deletion, I'll extract
the paths from the Netlink message's netxhops list, by constructing an `lcp_nl_route_path_parse_t`
by walking those Netlink nexthops, and optionally add a _special_ route (in case the route was
for unreachable/blackhole/prohibit in Linux -- those won't have a nexthop).
Then, insert the VPP paths found in the Netlink message into the FIB or the multicast FIB,
respectively.
## Control Plane: Bird
So with this newly added code, the example above of setting a default route shoots to life.
But I can do better! At IPng Networks, my routing suite of choice is Bird2, and I have some
code to generate configurations for it and push those configs safely to routers. So, let's
take a closer look at a configuration on the test machine running VPP + Linux CP with this
new Netlink route handler.
```
router id 194.1.163.86;
protocol device { scan time 10; }
protocol direct { ipv4; ipv6; check link yes; }
```
These first two protocols are internal implementation details. The first, called _device_
periodically scans the network interface list in Linux, to pick up new interfaces. You can
compare it to issuing `ip link` and acting on additions/removals as they occur. The second,
called _direct_, generates directly connected routes for interfaces that have IPv4 or IPv6
addresses configured. It turns out that if I add `194.1.163.86/27` as an IPv4 address on
an interface, it'll generate several Netlink messages: one for the `RTM_NEWADDR` which
I discussed in my [fourth post]({% post_url 2021-08-25-vpp-4 %}), and also a `RTM_NEWROUTE`
for the connected `194.1.163.64/27` in this case. It helps the kernel understand that if
we want to send a packet to a host in that prefix, we should not send it to the default
gateway, but rather to a nexthop of the device. Those are intermittently called `direct`
or `connected` routes. Ironically, these are called `RTS_DEVICE` routes in Bird2
[ref](https://github.com/BIRD/bird/blob/master/nest/route.h#L373) even though they are
generated by the `direct` routing protocol.
That brings me to the third protocol, one for each address type:
```
protocol kernel kernel4 {
ipv4 {
import all;
export where source != RTS_DEVICE;
};
}
protocol kernel kernel6 {
ipv6 {
import all;
export where source != RTS_DEVICE;
};
}
```
We're asking Bird to import any route it learns from the kernel, and we're asking it to
export any route that's not an `RTS_DEVICE` route. The reason for this is that when we
create IPv4/IPv6 addresses, the `ip` command already adds the connected route, and this
avoids Bird from inserting a second, identical route for those connected routes. And with
that, I have a very simple view, given for example these two interfaces:
```
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip route
45.129.224.232/29 dev ixp proto kernel scope link src 45.129.224.235
194.1.163.64/27 dev servers proto kernel scope link src 194.1.163.86
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip -6 route
2a0e:5040:0:2::/64 dev ixp proto kernel metric 256 pref medium
2001:678:d78:3::/64 dev servers proto kernel metric 256 pref medium
pim@hippo:/etc/bird$ birdc show route
BIRD 2.0.7 ready.
Table master4:
45.129.224.232/29 unicast [direct1 20:48:55.547] * (240)
dev ixp
194.1.163.64/27 unicast [direct1 20:48:55.547] * (240)
dev servers
Table master6:
2a0e:5040:1001::/64 unicast [direct1 20:48:55.547] * (240)
dev stucchi
2001:678:d78:3::/64 unicast [direct1 20:48:55.547] * (240)
dev servers
```
## Control Plane: OSPF
Considering the `servers` network above has a few OSPF speakers in it, I will introduce this
router there as well. The configuration is very straight forward in Bird, let's just add
the OSPF and OSPFv3 protocols as follows:
```
protocol ospf v2 ospf4 {
ipv4 { export where source = RTS_DEVICE; import all; };
area 0 {
interface "lo" { stub yes; };
interface "servers" { type broadcast; cost 5; };
};
}
protocol ospf v3 ospf6 {
ipv6 { export where source = RTS_DEVICE; import all; };
area 0 {
interface "lo" { stub yes; };
interface "servers" { type broadcast; cost 5; };
};
}
```
Here, I tell OSPF to export all `connected` routes, and accept any route given to it. The only
difference between IPv4 and IPv6 is that the former uses OSPF version 2 of the protocol, and IPv6
uses version 3 of the protocol. And, as with the `kernel` routing protocol above, each instance
has to has its own unique name, so I make the obvious choice.
Within a few seconds, the OSPF Hello packets can be seen going out of the `servers` interface,
and adjacencies form shortly thereafter:
```
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip ro | wc -l
83
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip -6 ro | wc -l
74
pim@hippo:~/src/lcpng$ birdc show ospf nei ospf4
BIRD 2.0.7 ready.
ospf4:
Router ID Pri State DTime Interface Router IP
194.1.163.3 1 Full/Other 39.588 servers 194.1.163.66
194.1.163.87 1 Full/DR 39.588 servers 194.1.163.87
194.1.163.4 1 Full/Other 39.588 servers 194.1.163.67
pim@hippo:~/src/lcpng$ birdc show ospf nei ospf6
BIRD 2.0.7 ready.
ospf6:
Router ID Pri State DTime Interface Router IP
194.1.163.87 1 Full/DR 32.221 servers fe80::5054:ff:feaa:2b24
194.1.163.3 1 Full/BDR 39.504 servers fe80::9e69:b4ff:fe61:7679
194.1.163.4 1 2-Way/Other 38.357 servers fe80::9e69:b4ff:fe61:a1dd
```
And all of these were inserted into the VPP forwarding information base, taking for example
the IPng router in Amsterdam, loopback address `194.1.163.32` and `2001:678:d78::8`:
```
DBGvpp# show ip fib 194.1.163.32
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, lcp-rt:1, nat-hi:2, ]
194.1.163.32/32 fib:0 index:70 locks:2
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
path-list:[49] locks:142 flags:shared,popular, uPRF-list:49 len:1 itfs:[16, ]
path:[69] pl-index:49 ip4 weight=1 pref=32 attached-nexthop: oper-flags:resolved,
194.1.163.67 TenGigabitEthernet3/0/1.3
[@0]: ipv4 via 194.1.163.67 TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca324615810000650800
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:72 buckets:1 uRPF:49 to:[0:0]]
[0] [@5]: ipv4 via 194.1.163.67 TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca324615810000650800
DBGvpp# show ip6 fib 2001:678:d78::8
ipv6-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, ]
2001:678:d78::8/128 fib:0 index:130058 locks:2
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
path-list:[116] locks:220 flags:shared,popular, uPRF-list:106 len:1 itfs:[16, ]
path:[141] pl-index:116 ip6 weight=1 pref=32 attached-nexthop: oper-flags:resolved,
fe80::9e69:b4ff:fe61:a1dd TenGigabitEthernet3/0/1.3
[@0]: ipv6 via fe80::9e69:b4ff:fe61:a1dd TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca3246158100006586dd
forwarding: unicast-ip6-chain
[@0]: dpo-load-balance: [proto:ip6 index:130060 buckets:1 uRPF:106 to:[0:0]]
[0] [@5]: ipv6 via fe80::9e69:b4ff:fe61:a1dd TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca3246158100006586dd
```
In the snippet above we can see elements of the Linux CP Netlink Listener plugin doing its work.
It found the right nexthop, the right interface, enabled the FIB entry, and marked it with the
correct _FIB_ source `lcp-rt-dynamic`. And, with OSPF and OSPFv3 now enabled, VPP has gained visibility
to all of my internal network:
```
pim@hippo:~/src/lcpng$ traceroute nlams0.ipng.ch
traceroute to nlams0.ipng.ch (2001:678:d78::8) from 2001:678:d78:3::86, 30 hops max, 24 byte packets
1 chbtl1.ipng.ch (2001:678:d78:3::1) 0.3182 ms 0.2840 ms 0.1841 ms
2 chgtg0.ipng.ch (2001:678:d78::2:4:2) 0.5473 ms 0.6996 ms 0.6836 ms
3 chrma0.ipng.ch (2001:678:d78::2:0:1) 0.7700 ms 0.7693 ms 0.7692 ms
4 defra0.ipng.ch (2001:678:d78::7) 6.6586 ms 6.6443 ms 6.9292 ms
5 nlams0.ipng.ch (2001:678:d78::8) 12.8321 ms 12.9398 ms 12.6225 ms
```
## Control Plane: BGP
But the holy grail, and what got me started on this whole adventure, is to be able to participate in the
_Default Free Zone_ using BGP, So let's put these plugins to the test and load up a so-called _full table_
which means: all the routing information needed to reach any part of the internet. As of August'21,
there are about 870'000 such prefixes for IPv4, and aboug 133'000 prefixes for IPv6. We passed the magic
1M number, which I'm sure makes some silicon vendors anxious, because lots of older kit in the field won't
scale beyond a certain size. VPP is totally immune to this problem, so here we go!
```
template bgp T_IBGP4 {
local as 50869;
neighbor as 50869;
source address 194.1.163.86;
ipv4 { import all; export none; next hop self on; };
};
protocol bgp rr4_frggh0 from T_IBGP4 { neighbor 194.1.163.140; }
protocol bgp rr4_chplo0 from T_IBGP4 { neighbor 194.1.163.148; }
protocol bgp rr4_chbtl0 from T_IBGP4 { neighbor 194.1.163.87; }
template bgp T_IBGP6 {
local as 50869;
neighbor as 50869;
source address 2001:678:d78:3::86;
ipv6 { import all; export none; next hop self ibgp; };
};
protocol bgp rr6_frggh0 from T_IBGP6 { neighbor 2001:678:d78:6::140; }
protocol bgp rr6_chplo0 from T_IBGP6 { neighbor 2001:678:d78:7::148; }
protocol bgp rr6_chbtl0 from T_IBGP6 { neighbor 2001:678:d78:3::87; }
```
And with these two blocks, I've added six new protocols -- three of them are IPv4 route-reflector
clients, and three of them are IPv6 ones. Once this commits, Bird will be able to find these IP
addresses due to the OSPF routes being loaded into the _FIB_, and once it does that, each of the
route-reflector servers will download a full routing table into Bird's memory, and in turn Bird
will use the `kernel4` and `kernel6` protocol to export them into Linux (essentially performing
an `ip ro add ... via ...` on each), and the kernel will then generate a Netlink message, which
the Linux CP **Netlink Listener** plugin will pick up and the rest, as they say, is history.
I gotta tell you - the first time I saw this working end to end, I was elated. Just seeing blocks
of 6800-7000 of these being pumped into VPP's _FIB_ each 40ms block was just .. magical. And the
performance is pretty good, too, because 7000/40ms is 175K/sec alluding to VPP operators being
able to not only consume but also program into the _FIB_, a full IPv4 and IPv6 table in about 6
seconds, whoa!
```
DBGvpp#
linux-cp/nl [warn ]: process_msgs: Processed 6550 messages in 40001 usecs, 2607 left in queue
linux-cp/nl [warn ]: process_msgs: Processed 6368 messages in 40000 usecs, 7012 left in queue
linux-cp/nl [warn ]: process_msgs: Processed 6460 messages in 40001 usecs, 13163 left in queue
...
linux-cp/nl [warn ]: process_msgs: Processed 6418 messages in 40004 usecs, 93606 left in queue
linux-cp/nl [warn ]: process_msgs: Processed 6438 messages in 40002 usecs, 96944 left in queue
linux-cp/nl [warn ]: process_msgs: Processed 6575 messages in 40002 usecs, 99986 left in queue
linux-cp/nl [warn ]: process_msgs: Processed 6552 messages in 40004 usecs, 94767 left in queue
linux-cp/nl [warn ]: process_msgs: Processed 5890 messages in 40001 usecs, 88877 left in queue
linux-cp/nl [warn ]: process_msgs: Processed 6829 messages in 40003 usecs, 82048 left in queue
...
linux-cp/nl [warn ]: process_msgs: Processed 6685 messages in 40004 usecs, 13576 left in queue
linux-cp/nl [warn ]: process_msgs: Processed 6701 messages in 40003 usecs, 6893 left in queue
linux-cp/nl [warn ]: process_msgs: Processed 6579 messages in 40003 usecs, 314 left in queue
DBGvpp#
```
Due to a good cooperative multitasking approach in the Netlink message queue producer, I will continuously
read Netlink messages from the kernel and put them in a queue, but only consume 40ms or 8000 messages
whichever comes first, after which I yield control back to VPP. So you can see here that when the
kernel is flooding the Netlink messages of the learned BGP routing table, the plugin correctly consumes
what it can, the queue grows (in this case to just about 100K messages) and then quickly shrinks again.
And indeed, Bird, IP and VPP all seem to agree, we did a good job:
```
pim@hippo:~/src/lcpng$ birdc show route count
BIRD 2.0.7 ready.
1741035 of 1741035 routes for 870479 networks in table master4
396518 of 396518 routes for 132479 networks in table master6
Total: 2137553 of 2137553 routes for 1002958 networks in 2 tables
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip -6 ro | wc -l
132430
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip ro | wc -l
870494
pim@hippo:~/src/lcpng$ vppctl sh ip6 fib sum | awk '$1~/[0-9]+/ { total += $2 } END { print total }'
132479
pim@hippo:~/src/lcpng$ vppctl sh ip fib sum | awk '$1~/[0-9]+/ { total += $2 } END { print total }'
870529
```
## Results
The functional regression test I made on day one, the one that ensures end-to-end connectivity to and
from the Linux host interfaces works for all 5 interface types (untagged, .1q tagged, QinQ, .1ad tagged
and QinAD) and for both physical and virtual interfaces (like `TenGigabitEthernet3/0/0` and `BondEthernet0`),
still works. Great.
Here's [a screencast](https://asciinema.org/a/432943) showing me playing around a bit with that configuration
shown above, demonstrating that RIB and FIB synchronisation works pretty well in both directions, making the
combination of these two plugins sufficient to run a VPP router in the _Default Free Zone_, Whoohoo!
<script id="asciicast-432943" src="https://asciinema.org/a/432943.js" async></script>
### Future work
**Atomic Updates** - When running VPP + Linux CP in a default free zone BGP environment,
IPv4 and IPv6 prefixes will be constantly updated as the internet topology morphs and changes.
One thing I noticed is that those are often deletes followed by adds with the exact same
nexthop (ie. something in Germany flapped, and this is not deduplicated), which shows up
as many of these pairs of messages like so:
```
linux-cp/nl [debug ]: route_del: netlink route/route: del family inet6 type 1 proto 12 table 254 dst 2a10:cc40:b03::/48 nexthops { gateway fe80::9e69:b4ff:fe61:a1dd idx 197 }
linux-cp/nl [debug ]: route_path_parse: path ip6 fe80::9e69:b4ff:fe61:a1dd, TenGigabitEthernet3/0/1.3, []
linux-cp/nl [info ]: route_del: table 254 prefix 2a10:cc40:b03::/48 flags
linux-cp/nl [debug ]: route_add: netlink route/route: add family inet6 type 1 proto 12 table 254 dst 2a10:cc40:b03::/48 nexthops { gateway fe80::9e69:b4ff:fe61:a1dd idx 197 }
linux-cp/nl [debug ]: route_path_parse: path ip6 fe80::9e69:b4ff:fe61:a1dd, TenGigabitEthernet3/0/1.3, []
linux-cp/nl [info ]: route_add: table 254 prefix 2a10:cc40:b03::/48 flags
linux-cp/nl [info ]: process_msgs: Processed 2 messages in 225 usecs
```
See how `2a10:cc40:b03::/48` is first removed, and then immediately reinstalled to the exact same
nexthop `fe80::9e69:b4ff:fe61:a1dd` on interface `TenGigabitEthernet3/0/1.3` ? Although it only takes
225&micro;s, it's still a bit sad to parse, create paths, just to remove from the FIB and re-insert the
exact same thing into the FIB. But more importantly, if a packet destined for this prefix arrives in that
225&micro;s window, it will be lost. So I think I'll build a peek-ahead mechanism to capture specifically
this occurence, and let the two del+add messages cancel each other out.
**Prefix updates towards lo** - When writing the code, I borrowed a bunch from the
pending [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)] but that one has a nasty crash which was hard to
debug and I haven't yet fully understood it. When a add/del occurs for a route towards IPv6 localhost (these
are typically seen when Bird shuts down eBGP sessions and I no longer have a path to a prefix, it'll mark
it as 'unreachable' rather than deleting it. These are *additions* which have a nexthop without a gateway
but with an interface index of 1 (which, in Netlink, is 'lo'). This makes VPP intermittently crash, so I
currently commented this out, while I gain better understanding. Result: blackhole/unreachable/prohibit
specials can not be set using the plugin. Beware!
(disabled in this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
## Credits
I'd like to make clear that the Linux CP plugin is a collaboration between several great minds,
and that my work stands on other software engineer's shoulders. In particular most of the Netlink
socket handling and Netlink message queueing was written by Matthew Smith, and I've had a little bit
of help along the way from Neale Ranns and Jon Loeliger. I'd like to thank them for their work!
## Appendix
#### VPP config
We only use one TenGigabitEthernet device on the router, and create two VLANs on it:
```
IP="sudo ip netns exec dataplane ip"
vppctl set logging class linux-cp rate-limit 1000 level warn syslog-level notice
vppctl lcp create TenGigabitEthernet3/0/1 host-if e1 netns dataplane
$IP link set e1 mtu 1500 up
$IP link add link e1 name ixp type vlan id 179
$IP link set ixp mtu 1500 up
$IP addr add 45.129.224.235/29 dev ixp
$IP addr add 2a0e:5040:0:2::235/64 dev ixp
$IP link add link e1 name servers type vlan id 101
$IP link set servers mtu 1500 up
$IP addr add 194.1.163.86/27 dev servers
$IP addr add 2001:678:d78:3::86/64 dev servers
```
#### Bird config
I'm using a purposefully minimalist configuration for demonstration purposes, posted here
in full for posterity:
```
log syslog all;
log "/var/log/bird/bird.log" { debug, trace, info, remote, warning, error, auth, fatal, bug };
router id 194.1.163.86;
protocol device { scan time 10; }
protocol direct { ipv4; ipv6; check link yes; }
protocol kernel kernel4 { ipv4 { import all; export where source != RTS_DEVICE; }; }
protocol kernel kernel6 { ipv6 { import all; export where source != RTS_DEVICE; }; }
protocol ospf v2 ospf4 {
ipv4 { export where source = RTS_DEVICE; import all; };
area 0 {
interface "lo" { stub yes; };
interface "servers" { type broadcast; cost 5; };
};
}
protocol ospf v3 ospf6 {
ipv6 { export where source = RTS_DEVICE; import all; };
area 0 {
interface "lo" { stub yes; };
interface "servers" { type broadcast; cost 5; };
};
}
template bgp T_IBGP4 {
local as 50869;
neighbor as 50869;
source address 194.1.163.86;
ipv4 { import all; export none; next hop self on; };
};
protocol bgp rr4_frggh0 from T_IBGP4 { neighbor 194.1.163.140; }
protocol bgp rr4_chplo0 from T_IBGP4 { neighbor 194.1.163.148; }
protocol bgp rr4_chbtl0 from T_IBGP4 { neighbor 194.1.163.87; }
template bgp T_IBGP6 {
local as 50869;
neighbor as 50869;
source address 2001:678:d78:3::86;
ipv6 { import all; export none; next hop self ibgp; };
};
protocol bgp rr6_frggh0 from T_IBGP6 { neighbor 2001:678:d78:6::140; }
protocol bgp rr6_chplo0 from T_IBGP6 { neighbor 2001:678:d78:7::148; }
protocol bgp rr6_chbtl0 from T_IBGP6 { neighbor 2001:678:d78:3::87; }
```
#### Final note
You may have noticed that the [commit] links are all to git commits in my private working copy. I
want to wait until my [previous work](https://gerrit.fd.io/r/c/vpp/+/33481) is reviewed and
submitted before piling on more changes. Feel free to contact vpp-dev@ for more information in the
mean time :-)

View File

@ -0,0 +1,409 @@
---
date: "2021-09-10T13:21:14Z"
title: VPP Linux CP - Part6
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. One thing notably missing, is the higher level control plane, that is
to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a
VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network
devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols
like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet
forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use
VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the
interface state (links, addresses and routes) itself. When the plugin is completed, running software
like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
&gt;100Mpps and &gt;100Gbps forwarding rates will be well in reach!
## SNMP in VPP
Now that the **Interface Mirror** and **Netlink Listener** plugins are in good shape, this post
shows a few finishing touches. First off, although the native habitat of VPP is [Prometheus](https://prometheus.io/),
many folks still run classic network monitoring systems like the popular [Obvervium](https://observium.org/)
or its sibling [LibreNMS](https://librenms.org/). Although the metrics-based approach is modern,
we really ought to have an old-skool [SNMP](https://datatracker.ietf.org/doc/html/rfc1157) interface
so that we can swear it _by the Old Gods and the New_.
### VPP's Stats Segment
VPP maintains lots of interesting statistics at runtime - for example for nodes and their activity,
but also, importantly, for each interface known to the system. So I take a look at the stats segment,
configured in `startup.conf`, and I notice that VPP will create socket in `/run/vpp/stats.sock` which
can be connected to. There's also a few introspection tools, notably `vpp_get_stats`, which can either
list, dump once, or continuously dump the data:
```
pim@hippo:~$ vpp_get_stats socket-name /run/vpp/stats.sock ls | wc -l
3800
pim@hippo:~$ vpp_get_stats socket-name /run/vpp/stats.sock dump /if/names
[0]: local0 /if/names
[1]: TenGigabitEthernet3/0/0 /if/names
[2]: TenGigabitEthernet3/0/1 /if/names
[3]: TenGigabitEthernet3/0/2 /if/names
[4]: TenGigabitEthernet3/0/3 /if/names
[5]: GigabitEthernet5/0/0 /if/names
[6]: GigabitEthernet5/0/1 /if/names
[7]: GigabitEthernet5/0/2 /if/names
[8]: GigabitEthernet5/0/3 /if/names
[9]: TwentyFiveGigabitEthernet11/0/0 /if/names
[10]: TwentyFiveGigabitEthernet11/0/1 /if/names
[11]: tap2 /if/names
[12]: TenGigabitEthernet3/0/1.1 /if/names
[13]: tap2.1 /if/names
[14]: TenGigabitEthernet3/0/1.2 /if/names
[15]: tap2.2 /if/names
[16]: TenGigabitEthernet3/0/1.3 /if/names
[17]: tap2.3 /if/names
[18]: tap3 /if/names
[19]: tap4 /if/names
```
Alright! Clearly, the `/if/` prefix is the one I'm looking for. I find a Python library that allows
for this data to be MMAPd and directly read as a dictionary, including some neat aggregation
functions (see `src/vpp-api/python/vpp_papi/vpp_stats.py`):
```
Counters can be accessed in either dimension.
stat['/if/rx'] - returns 2D lists
stat['/if/rx'][0] - returns counters for all interfaces for thread 0
stat['/if/rx'][0][1] - returns counter for interface 1 on thread 0
stat['/if/rx'][0][1]['packets'] - returns the packet counter
for interface 1 on thread 0
stat['/if/rx'][:, 1] - returns the counters for interface 1 on all threads
stat['/if/rx'][:, 1].packets() - returns the packet counters for
interface 1 on all threads
stat['/if/rx'][:, 1].sum_packets() - returns the sum of packet counters for
interface 1 on all threads
stat['/if/rx-miss'][:, 1].sum() - returns the sum of packet counters for
interface 1 on all threads for simple counters
```
Alright, so let's grab that file and refactor it into a small library for me to use, I do
this in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
### VPP's API
In a previous project, I already got a little bit of exposure to the Python API (`vpp_papi`),
and it's pretty straight forward to use. Each API is published in a JSON file in
`/usr/share/vpp/api/{core,plugins}/` and those can be read by the Python library and
exposed to callers. This gives me full programmatic read/write access to the VPP runtime
configuration, which is super cool.
There are dozens of APIs to call (the Linux CP plugin even added one!), and in the case
of enumerating interfaces, we can see the definition in `core/interface.api.json` where
there is an element called `services.sw_interface_dump` which shows its reply is
`sw_interface_details`, and in that message we can see all the fields that will be
set in the request and all that will be present in the response. Nice! Here's a quick
demonstration:
```python
from vpp_papi import VPPApiClient
import os
import fnmatch
import sys
vpp_json_dir = '/usr/share/vpp/api/'
# construct a list of all the json api files
jsonfiles = []
for root, dirnames, filenames in os.walk(vpp_json_dir):
for filename in fnmatch.filter(filenames, '*.api.json'):
jsonfiles.append(os.path.join(root, filename))
vpp = VPPApiClient(apifiles=jsonfiles, server_address='/run/vpp/api.sock')
vpp.connect("test-client")
v = vpp.api.show_version()
print('VPP version is %s' % v.version)
iface_list = vpp.api.sw_interface_dump()
for iface in iface_list:
print("idx=%d name=%s mac=%s mtu=%d flags=%d" % (iface.sw_if_index,
iface.interface_name, iface.l2_address, iface.mtu[0], iface.flags))
```
The output:
```
$ python3 vppapi-test.py
VPP version is 21.10-rc0~325-g4976c3b72
idx=0 name=local0 mac=00:00:00:00:00:00 mtu=0 flags=0
idx=1 name=TenGigabitEthernet3/0/0 mac=68:05:ca:32:46:14 mtu=9000 flags=0
idx=2 name=TenGigabitEthernet3/0/1 mac=68:05:ca:32:46:15 mtu=1500 flags=3
idx=3 name=TenGigabitEthernet3/0/2 mac=68:05:ca:32:46:16 mtu=9000 flags=1
idx=4 name=TenGigabitEthernet3/0/3 mac=68:05:ca:32:46:17 mtu=9000 flags=1
idx=5 name=GigabitEthernet5/0/0 mac=a0:36:9f:c8:a0:54 mtu=9000 flags=0
idx=6 name=GigabitEthernet5/0/1 mac=a0:36:9f:c8:a0:55 mtu=9000 flags=0
idx=7 name=GigabitEthernet5/0/2 mac=a0:36:9f:c8:a0:56 mtu=9000 flags=0
idx=8 name=GigabitEthernet5/0/3 mac=a0:36:9f:c8:a0:57 mtu=9000 flags=0
idx=9 name=TwentyFiveGigabitEthernet11/0/0 mac=6c:b3:11:20:e0:c4 mtu=9000 flags=0
idx=10 name=TwentyFiveGigabitEthernet11/0/1 mac=6c:b3:11:20:e0:c6 mtu=9000 flags=0
idx=11 name=tap2 mac=02:fe:07:ae:31:c3 mtu=1500 flags=3
idx=12 name=TenGigabitEthernet3/0/1.1 mac=00:00:00:00:00:00 mtu=1500 flags=3
idx=13 name=tap2.1 mac=00:00:00:00:00:00 mtu=1500 flags=3
idx=14 name=TenGigabitEthernet3/0/1.2 mac=00:00:00:00:00:00 mtu=1500 flags=3
idx=15 name=tap2.2 mac=00:00:00:00:00:00 mtu=1500 flags=3
idx=16 name=TenGigabitEthernet3/0/1.3 mac=00:00:00:00:00:00 mtu=1500 flags=3
idx=17 name=tap2.3 mac=00:00:00:00:00:00 mtu=1500 flags=3
idx=18 name=tap3 mac=02:fe:95:db:3f:c4 mtu=9000 flags=3
idx=19 name=tap4 mac=02:fe:17:06:fc:af mtu=9000 flags=3
```
So I added a little abstration with some error handling and one main function
to return interfaces as a Python dictionary of those `sw_interface_details`
tuples in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
### AgentX
Now that we are able to enumerate the interfaces and their metadata (like admin/oper
status, link speed, name, index, MAC address, and what have you), and as well
the highly sought after interface statistics as 64bit counters (with a wealth of
extra information like broadcast/multicast/unicast packets, octets received and
transmitted, errors and drops). I am ready to tie things together.
It took a bit of sleuthing, but I eventually found a library on sourceforge (!)
that has a rudimentary implementation of [RFC 2741](https://datatracker.ietf.org/doc/html/rfc2741)
which is the SNMP Agent Extensibility (AgentX) Protocol. In a nutshell, this allows
an external program to connect to the main SNMP daemon, register an interest in
certain OIDs, and get called whenever the SNMPd is being queried for them.
The flow is pretty simple (see section 6.2 of the RFC), the Agent (client):
1. opens a TCP or Unix domain socket to the SNMPd
1. sends an Open PDU, which the server will respond or reject.
1. (optionally) can send a Ping PDU, the server will respond.
1. registers an interest with Register PDU
It then waits and gets called by the SNMPd with Get PDUs (to retrieve one
single value), GetNext PDU (to enable snmpwalk), GetBulk PDU (to retrieve a whole
subsection of the MIB), all of which are answered by a Response PDU.
If the Agent is to support writing, it will also have to implement TestSet, CommitSet,
CommitUndoSet and CommitCleanupSet PDUs. For this agent, we don't need to implement
those, so I'll just ignore those requests and implement the read-only stuff. Sounds easy :)
The first order of business is to create the values for two main MIBs of interest:
1. `.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.` - This table is an older variant
and it contains a bunch of relevant fields, one per interface, notably `ifIndex`,
`ifName`, `ifType`, `ifMtu`, `ifSpeed`, `ifPhysAddress`, `ifOperStatus`, `ifAdminStatus`
and a bunch of 32bit counters for octets/packets in and out of the interfaces.
1. `.iso.org.dod.internet.mgmt.mib-2.ifMIB.ifMIBObjects.ifXTable.` - This table is a makeover
of the other one (the **X** here stands for eXtra), and adds a few 64 bit counters
for the interface stats, and as well an `ifHighSpeed` which is in megabits instead of
kilobits in the previous MIB.
Populating these MIBs can be done periodically by retrieving the interfaces from VPP and
then simply walking the dictionary with Stats Segment data. I then register these two
main MIB entrypoints with SNMPd as I connect to it, and spit out the correct values
once asked with `GetPDU` or `GetNextPDU` requests, by issuing a corresponding `ResponsePDU`
to the SNMP server -- it takes care of all the rest!
The resulting code is in [[this
commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
but you can also check out the whole thing on
[[Github](https://github.com/pimvanpelt/vpp-snmp-agent)].
### Building
Shipping a bunch of Python files around is not ideal, so I decide to build this stuff
together in a binary that I can easily distribute to my machines: I just simply install
`pyinstaller` with _PIP_ and run it:
```
sudo pip install pyinstaller
pyinstaller vpp-snmp-agent.py --onefile
## Run it on console
dist/vpp-snmp-agent -h
usage: vpp-snmp-agent [-h] [-a ADDRESS] [-p PERIOD] [-d]
optional arguments:
-h, --help show this help message and exit
-a ADDRESS Location of the SNMPd agent (unix-path or host:port), default localhost:705
-p PERIOD Period to poll VPP, default 30 (seconds)
-d Enable debug, default False
## Install
sudo cp dist/vpp-snmp-agent /usr/sbin/
```
### Running
After installing `Net-SNMP`, the default in Ubuntu, I do have to ensure that it runs in
the correct namespace. So what I do is disable the systemd unit that ships with the Ubuntu
package, and instead create these:
```
pim@hippo:~/src/vpp-snmp-agentx$ cat < EOF | sudo tee /usr/lib/systemd/system/netns-dataplane.service
[Unit]
Description=Dataplane network namespace
After=systemd-sysctl.service network-pre.target
Before=network.target network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
# PrivateNetwork will create network namespace which can be
# used in JoinsNamespaceOf=.
PrivateNetwork=yes
# To set `ip netns` name for this namespace, we create a second namespace
# with required name, unmount it, and then bind our PrivateNetwork
# namespace to it. After this we can use our PrivateNetwork as a named
# namespace in `ip netns` commands.
ExecStartPre=-/usr/bin/echo "Creating dataplane network namespace"
ExecStart=-/usr/sbin/ip netns delete dataplane
ExecStart=-/usr/bin/mkdir -p /etc/netns/dataplane
ExecStart=-/usr/bin/touch /etc/netns/dataplane/resolv.conf
ExecStart=-/usr/sbin/ip netns add dataplane
ExecStart=-/usr/bin/umount /var/run/netns/dataplane
ExecStart=-/usr/bin/mount --bind /proc/self/ns/net /var/run/netns/dataplane
# Apply default sysctl for dataplane namespace
ExecStart=-/usr/sbin/ip netns exec dataplane /usr/lib/systemd/systemd-sysctl
ExecStop=-/usr/sbin/ip netns delete dataplane
[Install]
WantedBy=multi-user.target
WantedBy=network-online.target
EOF
pim@hippo:~/src/vpp-snmp-agentx$ cat < EOF | sudo tee /usr/lib/systemd/system/snmpd-dataplane.service
[Unit]
Description=Simple Network Management Protocol (SNMP) Daemon.
After=network.target
ConditionPathExists=/etc/snmp/snmpd.conf
[Service]
Type=simple
ExecStartPre=/bin/mkdir -p /var/run/agentx-dataplane/
NetworkNamespacePath=/var/run/netns/dataplane
ExecStart=/usr/sbin/snmpd -LOw -u Debian-snmp -g vpp -I -smux,mteTrigger,mteTriggerConf -f -p /run/snmpd-dataplane.pid
ExecReload=/bin/kill -HUP \$MAINPID
[Install]
WantedBy=multi-user.target
EOF
pim@hippo:~/src/vpp-snmp-agentx$ cat < EOF | sudo tee /usr/lib/systemd/system/vpp-snmp-agent.service
[Unit]
Description=SNMP AgentX Daemon for VPP dataplane statistics
After=network.target
ConditionPathExists=/etc/snmp/snmpd.conf
[Service]
Type=simple
NetworkNamespacePath=/var/run/netns/dataplane
ExecStart=/usr/sbin/vpp-snmp-agent
Group=vpp
ExecReload=/bin/kill -HUP \$MAINPID
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
EOF
```
Note the use of `NetworkNamespacePath` here -- this ensures that the snmpd and its agent both
run in the `dataplane` namespace which was created by `netns-dataplane.service`.
## Results
I now install the binary and, using the `snmpd.conf` configuration file (see Appendix):
```
pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl stop snmpd
pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl disable snmpd
pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl daemon-reload
pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl enable netns-dataplane
pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl start netns-dataplane
pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl enable snmpd-dataplane
pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl start snmpd-dataplane
pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl enable vpp-snmp-agent
pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl start vpp-snmp-agent
pim@hippo:~/src/vpp-snmp-agentx$ sudo journalctl -u vpp-snmp-agent
[INFO ] agentx.agent - run : Calling setup
[INFO ] agentx.agent - setup : Connecting to VPP Stats...
[INFO ] agentx.vppapi - connect : Connecting to VPP
[INFO ] agentx.vppapi - connect : VPP version is 21.10-rc0~325-g4976c3b72
[INFO ] agentx.agent - run : Initial update
[INFO ] agentx.network - update : Setting initial serving dataset (740 OIDs)
[INFO ] agentx.agent - run : Opening AgentX connection
[INFO ] agentx.network - connect : Connecting to localhost:705
[INFO ] agentx.network - start : Registering: 1.3.6.1.2.1.2.2.1
[INFO ] agentx.network - start : Registering: 1.3.6.1.2.1.31.1.1.1
[INFO ] agentx.network - update : Replacing serving dataset (740 OIDs)
[INFO ] agentx.network - update : Replacing serving dataset (740 OIDs)
[INFO ] agentx.network - update : Replacing serving dataset (740 OIDs)
[INFO ] agentx.network - update : Replacing serving dataset (740 OIDs)
```
{{< image width="800px" src="/assets/vpp/librenms.png" alt="LibreNMS" >}}
## Appendix
#### SNMPd Config
```
$ cat << EOF | sudo tee /etc/snmp/snmpd.conf
com2sec readonly default <<some-string>>
group MyROGroup v2c readonly
view all included .1 80
access MyROGroup "" any noauth exact all none none
sysLocation Ruemlang, Zurich, Switzerland
sysContact noc@ipng.ch
master agentx
agentXSocket tcp:localhost:705,unix:/var/agentx/master,unix:/run/vpp/agentx.sock
agentaddress udp:161,udp6:161
#OS Distribution Detection
extend distro /usr/bin/distro
#Hardware Detection
extend manufacturer '/bin/cat /sys/devices/virtual/dmi/id/sys_vendor'
extend hardware '/bin/cat /sys/devices/virtual/dmi/id/product_name'
extend serial '/bin/cat /var/run/snmpd.serial'
EOF
```
Note the use of a few helpers here - `/usr/bin/distro` comes from LibreNMS [ref](https://docs.librenms.org/Support/SNMP-Configuration-Examples/)
and tries to figure out what distribution is used. The very last line of that file
echo's the found distribtion, to which I prepend the string, like `echo "VPP ${OSSTR}"`.
The other file of interest `/var/run/snmpd.serial` is computed at boot-time, by running
the following in `/etc/rc.local`:
```
# Assemble serial number for snmpd
BS=$(cat /sys/devices/virtual/dmi/id/board_serial)
PS=$(cat /sys/devices/virtual/dmi/id/product_serial)
echo $BS.$PS > /var/run/snmpd.serial
```
I have to do this, because SNMPd runs as non-privileged user, yet those DMI elements are
root-readable only (for reasons that are beyond me). Seeing as they will not change at
runtime anyway, I just create that file and cat it into the `serial` field. It then shows
up nicely in LibreNMS alongside the others.
{{< image width="200px" float="left" src="/assets/vpp/vpp.png" alt="VPP Hound" >}}
Oh, and one last thing. The VPP Hound logo!
In LibreNMS, the icons in the _devices_ view use a function that leveages this `distro`
field, by looking at the first word (in our case "VPP") with an extension of either .svg
or .png in an icons directory, usually `html/images/os/`. I dropped the hound of the
[fd.io](https://fd.io/) homepage in there, and will add the icon upstream for future use,
in this [[librenms PR](https://github.com/librenms/librenms/pull/13230)] and its companion
change to [[librenms-agent PR](https://github.com/librenms/librenms-agent/pull/374).

View File

@ -0,0 +1,567 @@
---
date: "2021-09-21T00:41:14Z"
title: VPP Linux CP - Part7
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. One thing notably missing, is the higher level control plane, that is
to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a
VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network
devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols
like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet
forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use
VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the
interface state (links, addresses and routes) itself. When the plugin is completed, running software
like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
&gt;100Mpps and &gt;100Gbps forwarding rates will be well in reach!
## Running in Production
In the first articles from this series, I showed the code that needed to be written to implement the
**Control Plane** and **Netlink Listener** plugins. In the [penultimate post]({% post_url 2021-09-10-vpp-6 %}),
I wrote an SNMP Agentx that exposes the VPP interface data to, say, LibreNMS.
But what are the things one might do to deploy a router end-to-end? That is the topic of this post.
### A note on hardware
Before I get into the details, here's some specifications on the router hardware that I use at
IPng Networks (AS50869). See more about our network [here]({% post_url 2021-02-27-network %}).
The chassis is a Supermicro SYS-5018D-FN8T, which includes:
* Full IPMI support (power, serial-over-lan and kvm-over-ip with HTML5), on a dedicated network port.
* A 4-core, 8-thread Xeon D1518 CPU which runs at 35W
* Two independent Intel i210 NICs (Gigabit)
* A Quad Intel i350 NIC (Gigabit)
* Two Intel X552 (TenGig)
* (optional) One Intel X710 Quad-TenGig NIC in the expansion bus
* m.SATA 120G boot SSD
* 2x16GB of ECC RAM
The only downside for this machine is that it has only one power supply, so datacenters which do
periodical feed-maintenance (such as Interxion is known to do), are likely to reboot the machine
from time to time. However, the machine is very well spec'd for VPP in "low" performance scenarios.
A machine like this is very affordable (I bought the chassis for about USD 800,- a piece) but its
CPU/Memory/PCIe construction is enough to provide forwarding at approximately 35Mpps.
Doing a lazy 1Mpps on this machine's Xeon D1518, VPP comes in at ~660 clocks per packet with a vector
length of ~3.49. This means that if I dedicate 3 cores running at 2200MHz to VPP (leaving 1C2T for
the controlplane), this machine has a forwarding capacity of ~34.7Mpps, which fits really well with
the Intel X710 NICs (which are limited to 40Mpps [[ref](https://trex-tgn.cisco.com/trex/doc/trex_faq.html)]).
A reasonable step-up from here would be Supermicro's SIS810 with a Xeon E-2288G (8 cores / 16 threads)
which carries a dual-PSU, up to 8x Intel i210 NICs and 2x Intel X710 Quad-Tengigs, but it's quite
a bit more expensive. I commit to do that the day AS50869 is forwarding 10Mpps in practice :-)
### Install HOWTO
First, I install the "canonical" (pun intended) operating system that VPP is most comfortable running
on: Ubuntu 20.04.3. Nothing special selected when installing and after the install is done, I make sure
that GRUB uses the serial IPMI port by adding to `/etc/default/grub`:
```
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 isolcpus=1,2,3,5,6,7"
GRUB_TERMINAL=serial
GRUB_SERIAL_COMMAND="serial --speed=115200 --unit=0 --word=8 --parity=no --stop=1"
# Followed by a gratuitous install and update
grub-install /dev/sda
update-grub
```
Note that the `isolcpus` is a neat trick that tells the Linux task scheduler to avoid scheduling any
workloads on those CPUs. Because the Xeon-D1518 has 4 cores (0,1,2,3) and 4 additional hyperthreads
(4,5,6,7), this stanza effectively makes core 1,2,3 unavailable to Linux, leaving only core 0 and its
hyperthread 4 are available. This means that our controlplane will have 2 CPUs available to run things
like Bird, SNMP, SSH etc, while hyperthreading is essentially turned off on CPU 1,2,3 giving
those cores entirely to VPP.
In case you were wondering why I would turn off hyperthreading in this way: hyperthreads share
CPU instruction and data cache. The premise of VPP is that a `vector` (a list) of packets will
go through the same routines (like `ethernet-input` or `ip4-lookup`) all at once. In such a
computational model, VPP leverages the i-cache and d-cache to have subsequent packets make use
of the warmed up cache from their predecessor, without having to use the (much slower, relatively
speaking) main memory.
The last thing you'd want, is for the hyperthread to come along and replace the cache contents with
what-ever it's doing (be it Linux tasks, or another VPP thread).
So: disaallowing scheduling on 1,2,3 and their counterpart hyperthreads 5,6,7 AND constraining
VPP to run only on lcore 1,2,3 will essentially maximize the CPU cache hitrate for VPP, greatly
improving performance.
### Network Namespace
Originally proposed by [TNSR](https://netgate.com/tnsr), a Netgate commercial productionization of
VPP, it's a good idea to run VPP and its controlplane in a separate Linux network
[namespace](https://man7.org/linux/man-pages/man8/ip-netns.8.html). A network namespace is
logically another copy of the network stack, with its own routes, firewall rules, and network
devices.
Creating a namespace looks like follows, on a machine running `systemd`, like Ubuntu or Debian:
```
cat << EOF | sudo tee /usr/lib/systemd/system/netns-dataplane.service
[Unit]
Description=Dataplane network namespace
After=systemd-sysctl.service network-pre.target
Before=network.target network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
# PrivateNetwork will create network namespace which can be
# used in JoinsNamespaceOf=.
PrivateNetwork=yes
# To set `ip netns` name for this namespace, we create a second namespace
# with required name, unmount it, and then bind our PrivateNetwork
# namespace to it. After this we can use our PrivateNetwork as a named
# namespace in `ip netns` commands.
ExecStartPre=-/usr/bin/echo "Creating dataplane network namespace"
ExecStart=-/usr/sbin/ip netns delete dataplane
ExecStart=-/usr/bin/mkdir -p /etc/netns/dataplane
ExecStart=-/usr/bin/touch /etc/netns/dataplane/resolv.conf
ExecStart=-/usr/sbin/ip netns add dataplane
ExecStart=-/usr/bin/umount /var/run/netns/dataplane
ExecStart=-/usr/bin/mount --bind /proc/self/ns/net /var/run/netns/dataplane
# Apply default sysctl for dataplane namespace
ExecStart=-/usr/sbin/ip netns exec dataplane /usr/lib/systemd/systemd-sysctl
ExecStop=-/usr/sbin/ip netns delete dataplane
[Install]
WantedBy=multi-user.target
WantedBy=network-online.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable netns-dataplane
sudo systemctl start netns-dataplane
```
Now, every time we reboot the system, a new network namespace will exist with the
name `dataplane`. That's where you've seen me create interfaces in my previous posts,
and that's where our life-as-a-VPP-router will be born.
### Preparing the machine
After creating the namespace, I'll install a bunch of useful packages and further prepare
the machine, but also I'm going to remove a few out-of-the-box installed packages:
```
## Remove what we don't need
sudo apt purge cloud-init snapd
## Usual tools for Linux
sudo apt install rsync net-tools traceroute snmpd snmp iptables ipmitool bird2 lm-sensors
## And for VPP
sudo apt install libmbedcrypto3 libmbedtls12 libmbedx509-0 libnl-3-200 libnl-route-3-200 \
libnuma1 python3-cffi python3-cffi-backend python3-ply python3-pycparser libsubunit0
## Disable Bird and SNMPd because it will be running in another namespace
for i in bird snmpd; do
sudo systemctl stop $i
sudo systemctl disable $i
sudo systemctl mask $i
done
# Ensure all temp/fan sensors are detected
sensors-detect --auto
```
### Installing VPP
After [building](https://fdio-vpp.readthedocs.io/en/latest/gettingstarted/developers/building.html)
the code, specifically after issuing a successful `make pkg-deb`, a set of Debian packages
will be in the `build-root` sub-directory. Take these and install them like so:
```
## Install VPP
sudo mkdir -p /var/log/vpp/
sudo dpkg -i *.deb
## Reserve 6GB (3072 x 2MB) of memory for hugepages
cat << EOF | sudo tee /etc/sysctl.d/80-vpp.conf
vm.nr_hugepages=3072
vm.max_map_count=7168
vm.hugetlb_shm_group=0
kernel.shmmax=6442450944
EOF
## Set 64MB netlink buffer size
cat << EOF | sudo tee /etc/sysctl.d/81-vpp-netlink.conf
net.core.rmem_default=67108864
net.core.wmem_default=67108864
net.core.rmem_max=67108864
net.core.wmem_max=67108864
EOF
## Apply these sysctl settings
sudo sysctl -p -f /etc/sysctl.d/80-vpp.conf
sudo sysctl -p -f /etc/sysctl.d/81-vpp-netlink.conf
## Add user to relevant groups
sudo adduser $USER bird
sudo adduser $USER vpp
```
Next up, I make a backup of the original, and then create a reasonable startup configuration
for VPP:
```
## Create suitable startup configuration for VPP
cd /etc/vpp
sudo cp startup.conf startup.conf.orig
cat << EOF | sudo tee startup.conf
unix {
nodaemon
log /var/log/vpp/vpp.log
full-coredump
cli-listen /run/vpp/cli.sock
gid vpp
exec /etc/vpp/bootstrap.vpp
}
api-trace { on }
api-segment { gid vpp }
socksvr { default }
memory {
main-heap-size 1536M
main-heap-page-size default-hugepage
}
cpu {
main-core 0
workers 3
}
buffers {
buffers-per-numa 128000
default data-size 2048
page-size default-hugepage
}
statseg {
size 1G
page-size default-hugepage
per-node-counters off
}
plugins {
plugin lcpng_nl_plugin.so { enable }
plugin lcpng_if_plugin.so { enable }
}
logging {
default-log-level info
default-syslog-log-level notice
}
EOF
```
A few notes specific to my hardware configuration:
* the `cpu` stanza says to run the main thread on CPU 0, and then run three workers (on
CPU 1,2,3; the ones for which I disabled the Linux scheduler by means of `isolcpus`). So
CPU 0 and its hyperthread CPU 4 are available for Linux to schedule on, while there are
three full cores dedicated to forwarding. This will ensure very low latency/jitter and
predictably high throughput!
* HugePages are a memory optimization mechanism in Linux. In virtual memory management,
the kernel maintains a table in which it has a mapping of the virtual memory address
to a physical address. For every page transaction, the kernel needs to load related
mapping. If you have small size pages then you need to load more numbers of pages
resulting kernel to load more mapping tables. This decreases performance. I set these
to a larger size of 2MB (the default is 4KB), reducing mapping load and thereby
considerably improving performance.
* I need to ensure there's enough _Stats Segment_ memory available - each worker thread
keeps counters of each prefix, and with a full BGP table (weighing in at 1M prefixes
in Q3'21), the amount of memory needed is substantial. Similarly, I need to ensure there
are sufficient _Buffers_ available.
Finally, observe the stanza `unix { exec /etc/vpp/bootstrap.vpp }` and this is a way for me to
tell VPP to run a bunch of CLI commands as soon as it starts. This ensures that if VPP were to
crash, or the machine were to reboot (more likely :-), that VPP will start up with a working
interface and IP address configuration, and any other things I might want VPP to do (like
bridge-domains).
A note on VPP's binding of interfaces: by default, VPP's `dpdk` driver will acquire any interface
from Linux that is not in use (which means: any interface that is admin-down/unconfigured).
To make sure that VPP gets all interfaces, I will remove `/etc/netplan/*` (or in Debian's case,
`/etc/network/interfaces`). This is why Supermicro's KVM and serial-over-lan are so valuable, as
they allow me to log in and deconfigure the entire machine, in order to yield all interfaces
to VPP. They also allow me to reinstall or switch from DANOS to Ubuntu+VPP on a server that's
700km away.
Anyway, I can start VPP simply like so:
```
sudo rm -f /etc/netplan/*
sudo rm -f /etc/network/interfaces
## Set any link to down, or reboot the machine and access over KVM or Serial
sudo systemctl restart vpp
vppctl show interface
```
See all interfaces? Great. Moving on :)
### Configuring VPP
I set a VPP interface configuration (which it'll read and apply any time it starts or restarts,
thereby making the configuration persistent across crashes and reboots). Using the `exec`
stanza described above, the contents now become, taking as an example, our first router in
Lille, France [[details]({% post_url 2021-05-28-lille %})], configured as so:
```
cat << EOF | sudo tee /etc/vpp/bootstrap.vpp
set logging class linux-cp rate-limit 1000 level warn syslog-level notice
lcp default netns dataplane
lcp lcp-sync on
lcp lcp-auto-subint on
create loopback interface instance 0
lcp create loop0 host-if loop0
set interface state loop0 up
set interface ip address loop0 194.1.163.34/32
set interface ip address loop0 2001:678:d78::a/128
lcp create TenGigabitEthernet4/0/0 host-if xe0-0
lcp create TenGigabitEthernet4/0/1 host-if xe0-1
lcp create TenGigabitEthernet6/0/0 host-if xe1-0
lcp create TenGigabitEthernet6/0/1 host-if xe1-1
lcp create TenGigabitEthernet6/0/2 host-if xe1-2
lcp create TenGigabitEthernet6/0/3 host-if xe1-3
lcp create GigabitEthernetb/0/0 host-if e1-0
lcp create GigabitEthernetb/0/1 host-if e1-1
lcp create GigabitEthernetb/0/2 host-if e1-2
lcp create GigabitEthernetb/0/3 host-if e1-3
EOF
```
This base-line configuration will:
* Ensure all host interfaces are created in namespace `dataplane` which we created earlier
* Turn on `lcp-sync`, which copies forward any configuration from VPP into Linux (see
[VPP Part 2]({% post_url 2021-08-13-vpp-2 %}))
* Turn on `lcp-auto-subint`, which automatically creates _LIPs_ (Linux interface pairs)
for all sub-interfaces (see [VPP Part 3]({% post_url 2021-08-15-vpp-3 %}))
* Create a loopback interface, give it IPv4/IPv6 addresses, and expose it to Linux
* Create one _LIP_ interface for four of the Gigabit and all 6x TenGigabit interfaces
* Leave 2 interfaces (`GigabitEthernet7/0/0` and `GigabitEthernet8/0/0`) for later
Further, sub-interfaces and bridge-groups might be configured as such:
```
comment { Infra: er01.lil01.ip-max.net Te0/0/0/6 }
set interface mtu packet 9216 TenGigabitEthernet6/0/2
set interface state TenGigabitEthernet6/0/2 up
create sub TenGigabitEthernet6/0/2 100
set interface mtu packet 9000 TenGigabitEthernet6/0/2.100
set interface state TenGigabitEthernet6/0/2.100 up
set interface ip address TenGigabitEthernet6/0/2.100 194.1.163.30/31
set interface unnumbered TenGigabitEthernet6/0/2.100 use loop0
comment { Infra: Bridge Domain for mgmt }
create bridge-domain 1
create loopback interface instance 1
lcp create loop1 host-if bvi1
set interface ip address loop1 192.168.0.81/29
set interface ip address loop1 2001:678:d78::1:a:1/112
set interface l2 bridge loop1 1 bvi
set interface l2 bridge GigabitEthernet7/0/0 1
set interface l2 bridge GigabitEthernet8/0/0 1
set interface state GigabitEthernet7/0/0 up
set interface state GigabitEthernet8/0/0 up
set interface state loop1 up
```
Particularly the last stanza, creating a bridge-domain, will remind Cisco operators
of the same semantics on the ASR9k and IOS/XR operating system. What it does is create
a bridge with two physical interfaces, and one so-called _bridge virtual interface_
which I expose to Linux as `bvi1`, with an IPv4 and IPv6 address. Beautiful!
### Configuring Bird
Now that VPP's interfaces are up, which I can validate with both `vppctl show int addr`
and as well `sudo ip netns exec dataplane ip addr`, I am ready to configure Bird and
put the router in the _default free zone_ (ie. run BGP on it):
```
cat << EOF > /etc/bird/bird.conf
router id 194.1.163.34;
protocol device { scan time 30; }
protocol direct { ipv4; ipv6; check link yes; }
protocol kernel kernel4 {
ipv4 { import none; export where source != RTS_DEVICE; };
learn off;
scan time 300;
}
protocol kernel kernel6 {
ipv6 { import none; export where source != RTS_DEVICE; };
learn off;
scan time 300;
}
include "static.conf";
include "core/ospf.conf";
include "core/ibgp.conf";
EOF
```
The most important thing to note in the configuration is that Bird tends to add a route
for all of the connected interfaces, while Linux has already added those. Therefore, I
avoid the source `RTS_DEVICE`, which means "connected routes", but otherwise offer all
routes to the kernel, which in turn propagates these as Netlink messages which are
consumed by VPP. A detailed discussion of Bird's configuration semantics is in my
[VPP Part 5]({% post_url 2021-09-02-vpp-5 %}) post.
### Configuring SSH
While Ubuntu (or Debian) will start an SSH daemon upon startup, they will do this in the
default namespace. However, our interfaces (like `loop0` or `xe1-2.100` above) are configured
to be present in the `dataplane` namespace. Therefor, I'll add a second SSH
daemon that runs specifically in the alternate namespace, like so:
```
cat << EOF | sudo tee /usr/lib/systemd/system/ssh-dataplane.service
[Unit]
Description=OpenBSD Secure Shell server (Dataplane Namespace)
Documentation=man:sshd(8) man:sshd_config(5)
After=network.target auditd.service
ConditionPathExists=!/etc/ssh/sshd_not_to_be_run
Requires=netns-dataplane.service
After=netns-dataplane.service
[Service]
EnvironmentFile=-/etc/default/ssh
ExecStartPre=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd -t
ExecStart=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd -oPidFile=/run/sshd-dataplane.pid -D $SSHD_OPTS
ExecReload=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd -t
ExecReload=/usr/sbin/ip netns exec dataplane /bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartPreventExitStatus=255
Type=notify
RuntimeDirectory=sshd
RuntimeDirectoryMode=0755
[Install]
WantedBy=multi-user.target
Alias=sshd-dataplane.service
EOF
sudo systemctl enable ssh-dataplane
sudo systemctl start ssh-dataplane
```
And with that, our loopback address, and indeed any other interface created in the
`dataplane` namespace, will accept SSH connections. Yaay!
### Configuring SNMPd
At IPng Networks, we use [LibreNMS](https://librenms.org/) to monitor our machines and
routers in production. Similar to SSH, I want the snmpd (which we disabled all the way
at the top of this article), to be exposed in the `dataplane` namespace. However, that
namespace will have interfaces like `xe0-0` or `loop0` or `bvi1` configured, and it's
important to note that Linux will only see those packets that were _punted_ by VPP, that
is to say, those packets which were destined to any IP address configured on the control
plane. Any traffic going _through_ VPP will never be seen by Linux! So, I'll have to be
clever and count this traffic by polling VPP instead. This was the topic of my previous
[VPP Part 6]({% post_url 2021-09-10-vpp-6 %}) about the SNMP Agent. All of that code
was released to [Github](https://github.com/pimvanpelt/vpp-snmp-agent), notably there's
a hint there for an `snmpd-dataplane.service` and a `vpp-snmp-agent.service`, including
the compiled binary that reads from VPP and feeds this to SNMP.
Then, the SNMP daemon configuration file, assuming `net-snmp` (the default for Ubuntu and
Debian) which was installed in the very first step above, I'll yield the following simple
configuration file:
```
cat << EOF | tee /etc/snmp/snmpd.conf
com2sec readonly default public
com2sec6 readonly default public
group MyROGroup v2c readonly
view all included .1 80
# Don't serve ipRouteTable and ipCidrRouteEntry (they're huge)
view all excluded .1.3.6.1.2.1.4.21
view all excluded .1.3.6.1.2.1.4.24
access MyROGroup "" any noauth exact all none none
sysLocation Rue des Saules, 59262 Sainghin en Melantois, France
sysContact noc@ipng.ch
master agentx
agentXSocket tcp:localhost:705,unix:/var/agentx/master
agentaddress udp:161,udp6:161
# OS Distribution Detection
extend distro /usr/bin/distro
# Hardware Detection
extend manufacturer '/bin/cat /sys/devices/virtual/dmi/id/sys_vendor'
extend hardware '/bin/cat /sys/devices/virtual/dmi/id/product_name'
extend serial '/bin/cat /var/run/snmpd.serial'
EOF
```
This config assumes that `/var/run/snmpd.serial` exists as a regular file rather than a `/sys`
entry. That's because while the `sys_vendor` and `product_name` fields are easily retrievable
as user from the `/sys` filesystem, for some reason `board_serial` and `product_serial` are
only readable by root, and our SNMPd runs as user `Debian-snmp`. So, I'll just generate this at
boot-time in `/etc/rc.local`, like so:
```
cat << EOF | sudo tee /etc/rc.local
#!/bin/sh
# Assemble serial number for snmpd
BS=\$(cat /sys/devices/virtual/dmi/id/board_serial)
PS=\$(cat /sys/devices/virtual/dmi/id/product_serial)
echo \$BS.\$PS > /var/run/snmpd.serial
[ -x /etc/rc.firewall ] && /etc/rc.firewall
EOF
sudo chmod 755 /etc/rc.local
sudo /etc/rc.local
sudo systemctl restart snmpd-dataplane
```
## Results
With all of this, I'm ready to pick up the machine in LibreNMS, which looks a bit like
this:
{{< image width="800px" src="/assets/vpp/librenms.png" alt="LibreNMS" >}}
Or a specific traffic pattern looking at interfaces:
{{< image width="800px" src="/assets/vpp/librenms-frggh0.png" alt="LibreNMS" >}}
Clearly, looking at the 17d of ~18Gbit of traffic going through this particular router,
with zero crashes and zero SNMPd / Agent restarts, this thing is a winner:
```
pim@frggh0:/etc/bird$ date
Tue 21 Sep 2021 01:26:49 AM UTC
pim@frggh0:/etc/bird$ ps auxw | grep vpp
root 1294 307 0.1 154273928 44972 ? Rsl Sep04 73578:50 /usr/bin/vpp -c /etc/vpp/startup.conf
Debian-+ 331639 0.2 0.0 21216 11812 ? Ss Sep04 22:23 /usr/sbin/snmpd -LOw -u Debian-snmp -g vpp -I -smux mteTrigger mteTriggerConf -f -p /run/snmpd-dataplane.pid
Debian-+ 507638 0.0 0.0 2900 592 ? Ss Sep04 0:00 /usr/sbin/vpp-snmp-agent -a localhost:705 -p 30
Debian-+ 507659 1.6 0.1 1317772 43508 ? Sl Sep04 2:16 /usr/sbin/vpp-snmp-agent -a localhost:705 -p 30
pim 510503 0.0 0.0 6432 736 pts/0 S+ 01:25 0:00 grep --color=auto vpp
```
Thanks for reading this far :-)

View File

@ -0,0 +1,160 @@
---
date: "2021-10-24T08:49:09Z"
title: IPng acquires AS8298
---
# A Brief History
In January of 2003, my buddy Jeroen announced a project called the [Ghost Route Hunters](/assets/as8298/RIPE44-IPv6-GRH.pdf), after the industry had been plagued for a few years with anomalies in the DFZ - routes would show up with phantom BGP paths, unable to be traced down to a source or faulty implementation. Jeroen presented his [findings](/assets/as8298/RIPE46-IPv6-Routing-Table-Anomalies.pdf) at RIPE-46 and for years after this, the industry used the [SixXS GRH](https://www.sixxs.net/tools/grh/how/) as a distributed looking glass. At the time, one of SixXS's point of presence providers kindly lent the project AS8298 to build this looking glass and underlying infrastructure.
After running SixXS for 16 years, Jeroen and I decided to [Sunset]({%post_url 2017-03-01-sixxs-sunset %}) it, which meant that in June of 2017, the Ghost Route Hunter project came to an end as well, and as we tore down the infrastructure, AS8298 became dormant.
Then in August of 2021, I was doing a little bit of cleaning on the IPng Networks serving infrastructure, and came across some old mail from RIPE NCC about that AS number. And while IPng Networks is running [just fine]({%post_url 2021-02-27-network %}) on AS50869 today, it would be just that little bit cooler if it were to run on AS8298. So, I embarked on a journey to move a running ISP into a new AS number, which sounds like fun! This post describes the situation going in to this renumbering project, and there will be another post, likely in January 2022, that describes the retrospective (this future post may be either celebratory, or a huge postmortem, to be determined).
## The Plan
#### Step 0. Acquire the AS
First off, I had to actually acquire the AS number. Back in the days (I'm speaking of 2002), RIPE NCC was a little bit less formal than it is today. As such, our loaned AS number was simply _registered_ to SixXS, which is not a legal entity. Later, to do things properly, it was placed in Jeroen's custody (by creating ORG-SIXX1-RIPE). But precisely because Jeroen nor SixXS was an LIR at that time, the previous holder (Easynet, with LIR `de.easynet`, later acquired by Sky, also called British Sky Broadcasting, trading under LIR `uk.bskyb`) became the sponsoring LIR. So I had to arrange two things:
1. A Transfer Agreement between Jeroen and myself; that signaled his willingness to transfer the AS number to me. This is [boilerplate](https://www.ripe.net/manage-ips-and-asns/resource-transfers-and-mergers/transfer-of-ip-addresses-and-as-numbers/transfer-agreement-template) stuff, and example contracts can be downloaded on the RIPE website.
1. An agreement between Sky and IPng Networks; that signaled the transfer from sponsoring LIR to our LIR `ch.ipng`. This is rather non-bureaucratic and a well traveled path; sponsoring LIRs and movements of resource between holders happen all the time.
With the permission of the previous holder, and with the help of the previous sponsoring LIR, the transfer itself was a matter of filing the correct paperwork at RIPE NCC, quoting the transfer agreement, and providing identification for the offering party (Jeroen) and the receiving party (Pim). And within a matter of a few days, the AS number was transfered to `ORG-PVP9-RIPE`, the `ch.ipng` LIR.
**NOTE** - In case you're wondering, I registered the `ch.ipng` LIR a few months *before* I incorporated IPng Networks as a swiss limited liability company (called a `GmbH`, or *Gesellschaft mit beschr&auml;nkter Haftung* in German), so for now I'm still trading as my natural person. RIPE has a cooldown period of two years before new LIRs can acquire/merge/rename. I do expect that some time in 2023 the PeeringDB page and bgp.he.net and friends to drop my personal name and take my company name :) For now, trading as Pim will have to do, slightly more business risk, but just as much fun!
#### Step 1. Split AS50869 into two networks
{{< image width="300px" float="right" src="/assets/network/zurich-ring.png" alt="Zurich Metro" >}}
The autonomous system of IPng Networks spans two main parts. Firstly, in Zurich IPng Networks operates four sites and six routers:
* Two in a private colocation site at Daedalean (**A**) in Albisrieden called `ddln0` and `ddln1`, they are running [DANOS](https://danosproject.org/)
* Two at our offices in Br&uuml;ttisellen (**C**), called `chbtl0` and `chbtl1`, they are running [Debian](https://debian.org/)
* One at Interxion ZUR1 datacenter in Glattbrugg (**D**), called `chgtg0`, running [VPP]({%post_url 2021-09-21-vpp-7 %}), connecting to a public internet exchange CHIX-CH and taking transit from IP-Max and Openfactory.
* One at NTT's datacenter in R&uuml;mlang (**E**), called `chrma0`, also running [VPP]({%post_url 2021-09-21-vpp-7 %}), connecting to a public internet exchange SwissIX and taking transit from IP-Max and Meerfarbig.
NOTE: You can read a lot about my work on VPP in a series of [VPP articles](/s/articles/), please take a look!
There's a few downstream IP Transit networks and lots of local connected networks, such as in the DDLN colo. Then, from `chrma0`, we connect to our european ring northwards (towards Frankfurt), and from `chgtg0` we connect to our european ring south-westwards (towards Geneva).
{{< image width="300px" float="right" src="/assets/network/european-ring.png" alt="European Ring" >}}
That ring, then, consists of five additional sites and five routers, all running [VPP]({%post_url 2021-09-21-vpp-7 %}):
* Frankfurt: `defra0`, connecting to four DE-CIX exchangepoints in Frankfurt itself directly, and remotely to Munich, D&uuml;sseldorf and Hamburg
* Amsterdam: `nlams0`, connecting to NL-IX, SpeedIX, FrysIX (our favorite!), and LSIX; we also pick up two transit providers (A2B and Coloclue).
* Lille: `frggh0`, connecting to the northern france exchange called LillIX
* Paris: `frpar0`, connecting to two FranceIX exchange points, directly in Paris, and remotely to Marseille
* Geneva: `chplo0`, connecting to our very own Free IX
Every one of these sites has an upstream session with AS25091 (IP-Max). Considering these folks are organizationally very close to me, it is easy for me to rejigger any one of those sessions between AS50869 (current) and AS8298 (new). And considering AS8298 has been a member of our as-set `AS-IPNG` for a while, it'll also be a natural propagation to rely on IP-Max, even if some peering sessions might be down.
So before I start, IPng Networks' view from AS25091 in Amsterdam looks like this:
```
Network Next Hop Metric LocPrf Weight Path
* 92.119.38.0/24 46.20.242.210 0 50869 i
* 176.119.215.0/24 46.20.242.210 0 50869 60557 i
* 185.36.229.0/24 46.20.242.210 0 50869 212855 i
* 185.173.128.0/24 46.20.242.210 0 50869 57777 i
* 185.209.12.0/24 46.20.242.210 0 50869 212323 i
* 192.31.196.0/24 46.20.242.210 0 50869 112 i
* 192.175.48.0/24 46.20.242.210 0 50869 112 i
* 194.1.163.0/24 46.20.242.210 0 50869 i
* 194.126.235.0/24 46.20.242.210 0 50869 i
Network Next Hop Metric LocPrf Weight Path
* 2001:4:112::/48 2a02:2528:1902::210 0 50869 112 i
* 2001:678:3d4::/48 2a02:2528:1902::210 0 50869 201723 i
* 2001:678:ce4::/48 2a02:2528:1902::210 0 50869 60557 i
* 2001:678:ce8::/48 2a02:2528:1902::210 0 50869 60557 i
* 2001:678:cec::/48 2a02:2528:1902::210 0 50869 60557 i
* 2001:678:cf0::/48 2a02:2528:1902::210 0 50869 60557 i
* 2001:678:d78::/48 2a02:2528:1902::210 0 50869 i
* 2001:67c:6bc::/48 2a02:2528:1902::210 0 50869 201723 i
* 2620:4f:8000::/48 2a02:2528:1902::210 0 50869 112 i
* 2a07:cd40::/29 2a02:2528:1902::210 0 50869 212855 i
* 2a0b:dd80::/29 2a02:2528:1902::210 0 50869 i
* 2a0b:dd80::/32 2a02:2528:1902::210 0 50869 i
* 2a0d:8d06::/32 2a02:2528:1902::210 0 50869 60557 i
* 2a0e:fd40:200::/48 2a02:2528:1902::210 0 50869 60557 i
* 2a0e:fd45:da0::/48 2a02:2528:1902::210 0 50869 60557 i
* 2a10:d200::/29 2a02:2528:1902::210 0 50869 212323 i
* 2a10:fc40::/29 2a02:2528:1902::210 0 50869 57777 i
```
#### Step 2. Restrict the routers that originate our prefixes
As a preparation to actually starting to use AS8298, I'll create RPKI records of authorization for all of our prefixes in **both** AS50869 and AS8298, and I'll add `route:` objects for all of them in both as well.
Now, I'm ready to make my first networking topology change: instead of originating our prefixes in _all_ routers, I will originate our prefixes in AS50869 only from _two_ routers: `chbtl0` and `chbtl1`. Nobody on the internet will notice this change, the as-path will remain `^50869$` for all prefixes.
I'll also prepare the two routers to speak to eachother with an iBGP session (rather than to IPng Networks' route-reflectors, which are still in AS50869).
#### Step 3. Convert Br&uuml;ttisellen to AS8298
Now, I'll switch these routers `chbtl0` and `chbtl1` out of AS50869 and into AS8298. The only damage at this point might be that my personal Spotify and Netflix stop working, and my family yells at me (but they do that all the time anyway, so it's a wash...). If things go poorly, the backout plan is to switch back to AS50869 and return things to normal. But if things go well, from this point onwards everybody will see our own IPng Networks prefixes via as-path `^50869_8298$` and effectively, AS50869 will become a transit provider for AS8298, which will be singlehomed for the moment.
My buddy Max runs a small /29 and /64 exchangepoint in Br&uuml;ttisellen with only three members on it - IPng Networks, Stucchinet and Openfactory. I will ask both of them to be my canary, and change their peering session from AS50869 to AS8298. If things go bad, that's no worries, I can drop/disable these peering sessions as I have sessions with both as well in other places. But it'd be a good place to see and test if things work as expected.
#### Step 4. Convert DDLN to AS8298
Now that our IPv4 and IPv6 prefixes have moved and AS50869 does not originate prefixes anymore, you may wonder 'what about the colo?'. Indeed, the colo runs at Daedalean behind `ddln0` and `ddln1`, both of which are still in AS50869. However, the only way to ever be able to reach those prefixes, would be to find an entrypoint in AS50869 (as it is the only uplink of AS8298). All routers in AS50869 and AS8298 share an underlying OSPF and OSPFv3 interior gateway protocol, which means that if anything is destined to `194.1.163.0/24` or `2001:678:d78::/48`, those packets will find their way to the correct location using the IGP. That's neat, because it means that even though `ddln0` is speaking BGP in AS50869, it will happily forward traffic to its more specific prefixes from AS8298.
Considering there's only one IP transit network in DDLN, those two routers will be the first for me to convert. After converting them to AS8298, they will receive transit from AS50869 just like the ones in Br&uuml;ttisellen. I'll rejigger Jeroen's AS57777 to receive transit directly from AS8298, and it will then be the first to transition. Jeroen's prefixes will briefly become `^50869_8298_57777$`, which will be the only change, but this will validate that, indeed, AS8298 can provide transit. Apart from the longer as-path, the physical path that IP packets take will remain the same because Jeroens network is currently *cough* singlehomed at DDLN.
Now that I have four routers in AS8298, I'll add a iBGP sessions directly only amongst the pairs, rather than in a full mesh on these routers. I've now created _two islands_ of AS8298, interconnected by AS50869. You'd think this is bad, but it's actually a fine situation to be in and there will be no loss of service, because:
* we've already established that AS50869 has reachability to all more-specifics in AS8298 (Step 3)
* AS50869 has reachability to AS57777 via its downstream AS8298
* AS50869 is the only way in and out of the DDLN or Br&uuml;ttisellen routers
Nevertheless, I'd rather swiftly continue with the next step, and reconnect these two islands. It's is a good time for me to give a headsup to the larger internet exchanges (notably SwissIX) so folks can prepare for what's coming. I think many NOC teams know how to establish/remove a peering session, but I do expect the ITIL-automated teams to not have a playbook for "peer on IP 192.0.2.1 has changed from AS666 to AS42". I'll observe their performance on this task and take notes, as there's quite a few public IXPs to go...
#### Step 5. Convert R&uuml;mlang and Glattbrugg to AS8298
After four of our routers (two redundant pairs) have been operating in AS8298 for a few days, I'm ready to renumber our first machine connected to a public internet exchange: `chgtg0` is connected to [CHIX-CH](https://chix.ch/). I'll contact the members with whom IPng Networks has a direct peering, and of course the internet exchange folks, to ask them to renumber our AS50869 to AS8298. After restarting our router into the new AS, one by one I'll establish sessions with our peers there - this is an important exercise because I'll be doing this later in Step 5 on every peering/border router in the european ring, and there will be in total ~1'850 BGP adjacencies that have to be renumbered.
At CHIX-CH, IPng has in total four BGP sessions with the two routeservers in AS212100, and in total 38 direct BGP sessions; two of which are somewhat important: AS13030 (Init7) and AS15169 (Google), to which a large fraction of our traffic flows. While upgrading this router, I'll also switch my one downstream network (Daedalean itself, operating AS212323) to receive transit from AS8298. Because I've canaried this with Jeroen's AS57777 in the DDLN colo previously, I'll be reasonably certain at this point that it'll work well. If not, they have alternative uplinks (notably AS174), so they should be fine without me.
At SwissIX, IPng has as well four BGP sessions with the two routeservers in AS42476, and in total 132 direct BGP sessions. I think that, once these two peering routers are complete, I'll checkpoint and let things run like this for a while. Let's take a few weeks off, giving me a while to hunt down peers at SwissIX and CH-IX to catch up and re-establish their sessions with the new AS8298 :)
After this step, phase one of the transition is complete, and AS8298 (and its networks AS57777 and AS212323) will be directly visible in Switzerland, and still tucked away behind `^50869_8298_212323$` for the international traffic. I will however have 2 transit sessions from IP-Max (AS25091), 2 transit sessions from Openfactory (AS58299) and 1 transit session from Meerfarbig (AS34549); so it is expected to be a quite stable network at this point, which is good.
#### Step 6. Convert European Ring to AS8298
For this task, I'll swap my iBGP sessions to all use the new AS8298. I do this by first dismantling `rr0.chbtl0.ipng.ch` and bring it back up as an iBGP speaker in the new AS; then one by one push all routers to speak to that route reflector (in addition to the existing route reflectors in AS50869). After that stabilizes, rinse and repeat with `rr0.chplo0.ipng.ch`; and finally finish the job with `rr0.frggh0.ipng.ch`. The two pairs of routers who were by themselves in AS8298 (`chbtl0`/`chbtl1` in Br&uuml;ttisellen; and `ddln0`/`ddln1` at Daedalean) can now be reattached to the iBGP route-reflectors as well.
It will be a bit of a schlepp, but now comes the notification of all international peers (there are an additional ~250 direct peerings) and downstreams (there are three left: Raymon, Jelle and Krik), and upstreams (there are two additional ones: Coloclue, and A2B). While normally this is a matter of merely swapping the AS number, it has to be done on both sides - on my side, I can do it with a one-line change to the git repository, and it'll be pushed by Kees (the network build and push automation that was inspired by Coloclue's [Kees](https://github.com/coloclue/kees/)), on the remote side it will be a matter of patience. One by one, folks will (or.. won't) update their peering session. The only folks I'll actively chase is the DE-CIX, FranceIX, NL-IX, LSIX and FrysIX routeserver operators, as the vast majority of adjacencies are learned via those. By means of a headsup one week in advance, and a few reminders on the day of, and the day after maintenance, I should minimize downtime. But, in this case, because I already have two transit providers in Switzerland (AS25091 IP-Max, and AS58299 Openfactory) who provide me full tables in AS8298, it should be operationally smooth sailing.
At the end of this exercise, the as-path will be `^8298$` for my own prefixes and `^8298_.*$` for my downstream networks, and AS50869 will no longer be in any as-path in the DFZ. Rolling back is tricky, although Bird can do individual peering sessions with differing AS numbers, I don't think this is a good idea; as it'll mean many (many) changes into the repository; so in the interest of simplicity and don't break things that work, I will do them router-by-router rather than session-by-session; and send a few reminders to folks to update their records to match my peeringdb entries.
#### Step 7. Repurpose/retire AS50869
There's really not that much to do -- delete the `route:` and `route6:` objects and remove RPKI ROAs for the old announcments; but mostly it'll be a matter of hunting down peering partners who have not (yet) updated their records and sessions. I imagine lots of folks will hesitate and be unfamiliar with this type of operation (even though it literally is an `s/50869/8298/g` for them). I'll take most of December to remind folks a few times, and ultimately just clean up broken peering sessions in January 2022.
And of course, then lick my wounds and count pokemon - on October 24th, the day this post was published, Hurricane Electric showed 1'845 adjacencies in total for AS50869, of which 1'653 IPv4 and 1'430 IPv6. I will consider it a success if I lose less than 200 adjacencies. I'll keep AS50869 around, as a test AS number to do a few experiments.
## The Timeline
Most all intrusive maintenances will be done in maintenance windows between 22:00 - 03:00 UTC from Thursday to Friday. The tentative planning for the project, which starts on October 22nd and lasts through end of year (10 weeks):
1. 2021-10-22 - RIPE updates for `route:`, `route6:` and RPKI ROAs
1. 2021-10-24 - Originate prefixes from chbtl0/chbtl1 (no-op for the DFZ)
1. 2021-10-28 - Move chbtl0/chbtl1 at Br&uuml;ttisellen to AS8298
1. 2021-10-29 - Update Br&uuml;ttisellen IXP (AS58299 and AS58280), canary upstream AS58299
1. 2021-10-29 - Headsup to CHIX-CH and SwissIX peers and Transit partners, announcing move
1. 2021-11-04 - Move ddln0/ddln1 at Daedalean to AS8298, canary downstream AS57777
1. 2021-11-11 - Move chgtg0 Glattbrugg to AS8298, add downstream AS212323
1. 2021-11-12 - Move CHIX-CH peers to AS8298
1. 2021-11-15 - Move chrma0 R&uuml;mlang to AS8298
1. 2021-11-16 - Move SwissIX peers to AS8298
1. Two week cooldown period, start to move IXPs to AS8298
1. 2021-11-29 - Headsup to all european IXPs and IP Transit partners, announcing move
1. 2021-12-02 - Move defra0, nlams0, frggh0, frpar0, chplo0 to AS8298
1. 2021-12 - Move european IXPs to AS8298
1. 2021-12-06 - First reminder for peerings that did not re-establish
1. 2021-12-13 - Second reminder for peerings that did not re-establish
1. 2021-12-20 - Third (final) reminder for peerings that did not re-establish
1. 2022-01-10 - Remove peerings that did not re-establish
## Appendix
This blogpost turned into a talk at [RIPE #83](/assets/as8298/RIPE83-IPngNetworks.pdf), in case you wanted to take a look at [the recording](https://ripe83.ripe.net/archives/video/649/) and [a few questions](https://ripe83.ripe.net/archives/video/650/).

View File

@ -0,0 +1,726 @@
---
date: "2021-11-14T22:49:09Z"
title: Case Study - BGP Routing Policy
---
# Introduction
BGP Routing policy is a very interesting topic. I get asked about it formally
and informally all the time. I have to admit, there are lots of ways to organize
an automous system. Vendors have unique features and templating / procedural
functions, but in the end, BGP routing policy all boils down to two+two things:
1. Not accepting the prefixes you don't want (inbound)
* For those prefixes accepted, ensure they have correct attributes.
1. Not announcing prefixes to folks who shouldn't see them (outbound)
* For those prefixes announced, ensure they have correct attributes.
At IPng Networks, I've cycled through a few iterations and landed on a specific
setup that works well for me. It provides sufficient information to enable our
downstream (customers) to make good decisions on what they should accept from
us, as well as enough expressivity for them to determine which prefixes we
should propagate for them, where, and how.
This article describes one approach to a relatively feature rich routing policy
which is in use at IPng Networks (AS8298). It uses the [Bird2](https://bird.network.cz/) configuration
language, although the concepts would be implementable in ~any modern routing
suite (ie. FRR, Cisco, Juniper, Arista, Extreme, et cetera).
Interested in one operator's opinion? Read on!
## 1. Concepts
There are three basic pieces of routing filtering, which I'll describe briefly.
### Prefix Lists
A prefix list (also sometimes referred to as an access-list in older software)
is a list of IPv4 of IPv6 prefixes, often with a prefixlen boundary, that
determines if a given prefix is "in" or "out".
An example could be: `2001:db8::/32{32,48}` which describes any prefix in the
supernet `2001:db8::/32` that has a prefix length of anywhere between /32 and
/48, inclusive.
### AS Paths
In BGP, each prefix learned comes with an AS path on how to reach it. If my router
learns a prefix from a peer with AS number `65520`, it'll see every prefix that peer
sends as a list of AS numbers starting with 65520. With AS Paths, the very first
one in the list is the one the router directly learned the prefix from, and the very
last one is the origin of the prefix. Often times the prefix is shown as a regular
expression, starting with `^` and ending with `$` and to help readability,
spaces are often written as `_`.
Examples: `^25091_1299_3301$` and `^58299_174_1299_3301$`
### BGP Communities
When learning (or originating) a prefix in BGP, zero or more so called `communities`
can be added to it along the way. The _Routing Information Base_ or _RIB_ carries
these communities and can share them between peering sessions. Communities can be
added, removed and modified. Some communities have special meaning (which is
agreed upon by everyone), and some have local meaning (agreed upon by only
one or a small set of operators).
There's three types of communities: _normal_ communities are a pair of 16-bit
integers; _extended_ communities are 8 bytes, split into one 16-bit integer
and an additional 48-bit value; and finally _large_ communities consist of
a triplet of 32-bit values.
Examples: `(8298, 1234)` (normal), or `(8298, 3, 212323)` (large)
# Routing Policy
Now that I've explained a little bit about the ingredients we have to work with,
let me share an observation that took me a few decades to make: BGP sessions are
really all the same. As such, every single one of the BGP sessions at IPng Networks
are generated with one template. What makes the difference between 'Transit', 'Customer'
and 'Peer' and 'Private Interconnect', really all boils down to what types of filtering
are applied on in- and outbound updates. I will demonstrate this by means of two main
functions in Bird: `ebgp_import()` discussed first in the section ***Inbound: Learning
Routes*** section, and `ebgp_export()` in the section ***Outbound: Announcing Routes***.
## 2. Inbound: Learning Routes
Let's consider this function:
```
function ebgp_import(int remote_as) {
if aspath_bogon() then return false;
if (net.type = NET_IP4 && ipv4_bogon()) then return false;
if (net.type = NET_IP6 && ipv6_bogon()) then return false;
if (net.type = NET_IP4 && ipv4_rpki_invalid()) then return false;
if (net.type = NET_IP6 && ipv6_rpki_invalid()) then return false;
# Demote certain AS nexthops to lower pref
if (bgp_path.first ~ AS_LOCALPREF50 && bgp_path.len > 1) then bgp_local_pref = 50;
if (bgp_path.first ~ AS_LOCALPREF30 && bgp_path.len > 1) then bgp_local_pref = 30;
if (bgp_path.first ~ AS_LOCALPREF10 && bgp_path.len > 1) then bgp_local_pref = 10;
# Graceful Shutdown (RFC8326)
if (65535, 0) ~ bgp_community then bgp_local_pref = 0;
# Scrub BLACKHOLE community
bgp_community.delete((65535, 666));
return true;
}
```
The function works by order of elimination -- for each prefix that is offered on the
session, it will either be rejected (by means of returning `false`), or modified (by means
of setting attributes like `bgp_local_pref`) and then accepted (by means of returning
`true`).
***AS-Path Bogon*** filtering is a way to remove prefixes that have an invalid AS
number in their path. The main example of this are private AS numbers (64496-131071)
and their 32 bit equivalents (4200000000-4294967295). In case you haven't come across
this yet, AS number 23456 is also magic, see [RFC4893](https://datatracker.ietf.org/doc/html/rfc4893)
for details:
```
function aspath_bogon() {
return bgp_path ~ [0, 23456, 64496..131071, 4200000000..4294967295];
}
```
***Prefix Bogon*** comes next, as certain prefixes that are not publicly routable (you
know, such as [RFC1918](https://datatracker.ietf.org/doc/html/rfc1918), but there are
many others). They look differently for IPv4 and IPv6:
```
function ipv4_bogon() {
return net ~ [
0.0.0.0/0, # Default
0.0.0.0/32-, # RFC 5735 Special Use IPv4 Addresses
0.0.0.0/0{0,7}, # RFC 1122 Requirements for Internet Hosts -- Communication Layers 3.2.1.3
10.0.0.0/8+, # RFC 1918 Address Allocation for Private Internets
100.64.0.0/10+, # RFC 6598 IANA-Reserved IPv4 Prefix for Shared Address Space
127.0.0.0/8+, # RFC 1122 Requirements for Internet Hosts -- Communication Layers 3.2.1.3
169.254.0.0/16+, # RFC 3927 Dynamic Configuration of IPv4 Link-Local Addresses
172.16.0.0/12+, # RFC 1918 Address Allocation for Private Internets
192.0.0.0/24+, # RFC 6890 Special-Purpose Address Registries
192.0.2.0/24+, # RFC 5737 IPv4 Address Blocks Reserved for Documentation
192.168.0.0/16+, # RFC 1918 Address Allocation for Private Internets
198.18.0.0/15+, # RFC 2544 Benchmarking Methodology for Network Interconnect Devices
198.51.100.0/24+, # RFC 5737 IPv4 Address Blocks Reserved for Documentation
203.0.113.0/24+, # RFC 5737 IPv4 Address Blocks Reserved for Documentation
224.0.0.0/4+, # RFC 1112 Host Extensions for IP Multicasting
240.0.0.0/4+ # RFC 6890 Special-Purpose Address Registries
];
}
function ipv6_bogon() {
return net ~ [
::/0, # Default
::/96, # IPv4-compatible IPv6 address - deprecated by RFC4291
::/128, # Unspecified address
::1/128, # Local host loopback address
::ffff:0.0.0.0/96+, # IPv4-mapped addresses
::224.0.0.0/100+, # Compatible address (IPv4 format)
::127.0.0.0/104+, # Compatible address (IPv4 format)
::0.0.0.0/104+, # Compatible address (IPv4 format)
::255.0.0.0/104+, # Compatible address (IPv4 format)
0000::/8+, # Pool used for unspecified, loopback and embedded IPv4 addresses
0100::/8+, # RFC 6666 - reserved for Discard-Only Address Block
0200::/7+, # OSI NSAP-mapped prefix set (RFC4548) - deprecated by RFC4048
0400::/6+, # RFC 4291 - Reserved by IETF
0800::/5+, # RFC 4291 - Reserved by IETF
1000::/4+, # RFC 4291 - Reserved by IETF
2001:10::/28+, # RFC 4843 - Deprecated (previously ORCHID)
2001:20::/28+, # RFC 7343 - ORCHIDv2
2001:db8::/32+, # Reserved by IANA for special purposes and documentation
2002:e000::/20+, # Invalid 6to4 packets (IPv4 multicast)
2002:7f00::/24+, # Invalid 6to4 packets (IPv4 loopback)
2002:0000::/24+, # Invalid 6to4 packets (IPv4 default)
2002:ff00::/24+, # Invalid 6to4 packets
2002:0a00::/24+, # Invalid 6to4 packets (IPv4 private 10.0.0.0/8 network)
2002:ac10::/28+, # Invalid 6to4 packets (IPv4 private 172.16.0.0/12 network)
2002:c0a8::/32+, # Invalid 6to4 packets (IPv4 private 192.168.0.0/16 network)
3ffe::/16+, # Former 6bone, now decommissioned
4000::/3+, # RFC 4291 - Reserved by IETF
5f00::/8+, # RFC 5156 - used for the 6bone but was returned
6000::/3+, # RFC 4291 - Reserved by IETF
8000::/3+, # RFC 4291 - Reserved by IETF
a000::/3+, # RFC 4291 - Reserved by IETF
c000::/3+, # RFC 4291 - Reserved by IETF
e000::/4+, # RFC 4291 - Reserved by IETF
f000::/5+, # RFC 4291 - Reserved by IETF
f800::/6+, # RFC 4291 - Reserved by IETF
fc00::/7+, # Unicast Unique Local Addresses (ULA) - RFC 4193
fe80::/10+, # Link-local Unicast
fec0::/10+, # Site-local Unicast - deprecated by RFC 3879 (replaced by ULA)
ff00::/8+ # Multicast
];
}
```
That's a long list!! But operators on the _DFZ_ should really never be accepting any
of these, and we should all collectively yell at those who propagate them.
***RPKI Filtering*** is a fantastic routing security feature, described in [RFC6810](https://datatracker.ietf.org/doc/html/rfc6810)
and relatively straight forward to implement. For each _originating_ AS number, we can
check in a table of known `<origin,prefix>` mapping, if it is the correct ISP to
originate the prefix. The lookup can either match (which makes the prefix RPKI valid),
the lookup can fail because the prefix is missing (which makes the prefix RPKI unknown),
and it can specifically mismatch (which makes the prefix RPKI invalid). Operators are
encouraged to flag and drop _invalid_ prefixes:
```
function ipv4_rpki_invalid() {
return roa_check(t_roa4, net, bgp_path.last) = ROA_INVALID;
}
function ipv6_rpki_invalid() {
return roa_check(t_roa6, net, bgp_path.last) = ROA_INVALID;
}
```
***NOTE***: In NLNOG my post sparked a bit of debate on the use of `bgp_path.last_nonaggregated`
versus simply `bgp_path.last`. Job Snijders did some spelunking and offered [this post](https://bird.network.cz/pipermail/bird-users/2019-September/013805.html) and a reference to [RFC6907](https://datatracker.ietf.org/doc/html/rfc6907) for details, and
Tijn confirmed that Coloclue (on which many of my approaches have been modeled) indeed uses
`bgp_path.last`. I've updated my configs, with many thanks for the discussion.
Alright, now that I've determined the as-path and prefix are kosher, and that it
is not known to be hijacked (ie. is either `ROA_VALID` or `ROA_UNKNOWN`), I'm ready
to set a few attributes, notably:
* ***AS_LOCALPREF*** If the peer I learned this prefix from is in the given list, set
the BGP local preference to either 50, 30 or 10 respectively (a lower localpref means
the prefix is less likely to be selected). Some internet providers send lots of
prefixes, but have poor network connectivity to the place I learned the routes from
(a few examples to this, 6939 is often oversubscribed in Amsterdam, and 39533 was
for a while connected via a tunnel (!) to Zurich, and several hobby/amateur IXPs are
on a VXLAN bridged domain rather than a physical switch).
* ***Graceful Shutdown*** described in [RFC8326](https://datatracker.ietf.org/doc/html/rfc8326),
shows a way to allow operators to pre-announce their downtime by setting a special
BGP community that informs their peers to deselect that path by setting the local
preference to the lowest possible value. This oneliner matching on `(65535,0)`
implements that behavior.
* ***Blackhole Community*** described in [RFC7999](https://datatracker.ietf.org/doc/html/rfc7999),
is another special BGP community of `(65535,666)` which signals the need to stop sending
traffic to the prefix at hand. I haven't yet implemented the blackhole routing (this has
to do with an intricacy of the VPP Linux-CP code that I wrote), so for now I'll just remove
the community.
Alright, based on this one template, I'm now ready to implement all three types of
BGP session: ***Peer***, ***Upstream***, and ***Downstream***.
### Peers
```
function ebgp_import_peer(int remote_as) {
# Scrub BGP Communities (RFC 7454 Section 11)
bgp_community.delete([(8298, *)]);
bgp_large_community.delete([(8298, *, *)]);
return ebgp_import(remote_as);
}
```
It's dangerous to accept communities for my own AS8298 from peers. This is because
several of them can actively change the behavior of route propagation (these types
of communities are commonly called _action_ communities). So with peering
relationships, I'll just toss them all.
Now, working my way up to the actual BGP peering session, taking for example a
peer that I'm connecting to at LSIX (the routeserver, in fact) in Amsterdam:
```
filter ebgp_lsix_49917_import {
if ! ebgp_import_peer(49917) then reject;
# Add IXP Communities
bgp_community.add((8298,1036));
bgp_large_community.add((8298,1,1036));
accept;
}
protocol bgp lsix_49917_ipv4_1 {
description "LSIX IX Route Servers (LSIX)";
local as 8298;
source address 185.1.32.74;
neighbor 185.1.32.254 as 49917;
default bgp_med 0;
default bgp_local_pref 200;
ipv4 {
import keep filtered;
import filter ebgp_lsix_49917_import;
export filter ebgp_lsix_49917_export;
receive limit 100000 action restart;
next hop self on;
};
};
```
Parsing this through: the ipv4 import filter is called `ebgp_lsix_49917_import` and its
job is to run the whole kittenkaboodle of filtering I described above, and then if the
`ebgp_import_peer()` function returns false, to simply drop the prefix. But if it is
accepted, I'll tag it with a few communities. As I'll show later, any other peer will receive
these communities if I decide to propagate the prefix to them. This is specifically
useful for downstream (customers), who can decide to accept/deny the prefix based on a
wellknown set of communities we tag.
***IXP Community***: If the prefix is learned at an IXP, I'll add a large community
`(8298,1,*)` and backwards compat normal community `(8298,10XX)`.
One last thing I'll note, and this is a matter of taste, is for most peering prefixes
picked up at internet exchanges (like LSIX), are typically much cheaper per megabit than
the transit routes, so I will set a default `bgp_local_pref` of 200 (higher localpref
is more likely to be selected as the active route).
### Upstream
An interesting observation: from Peers and from Upstreams I typically am happy to take
all the prefixes I can get (but see the epilog below for an important note on this). For a
Peer, this is mostly "their own prefixes" and for a Transit, this is mostly "all prefixes",
but there's things in the middle, say partial transit of "all prefixes learned at IXP A B
and C". Really, all inbound sessions are very similar:
```
function ebgp_import_upstream(int remote_as) {
# Scrub BGP Communities (RFC 7454 Section 11)
bgp_community.delete([(8298, *)]);
bgp_large_community.delete([(8298, *, *)]);
return ebgp_import(remote_as);
}
```
... is in fact identical to the `ebgp_import_peer()` function above, so I'll not discuss
it further. But for the sessions to upstream (==transit) providers, it can make sense
to use slightly different BGP community tags and a lower localpref:
```
filter ebgp_ipmax_25091_import {
if ! ebgp_import_upstream(25091) then reject;
# Add BGP Large Communities
bgp_large_community.add((8298,2,25091));
# Add BGP Communities
bgp_community.add((8298,2000));
accept;
}
protocol bgp ipmax_25091_ipv4_1 {
description "IP-Max Transit";
local as 8298;
source address 46.20.242.210;
neighbor 46.20.242.209 as 25091;
default bgp_med 0;
default bgp_local_pref 50;
ipv4 {
import keep filtered;
import filter ebgp_ipmax_25091_import;
export filter ebgp_ipmax_25091_export;
next hop self on;
};
};
```
Again, a very similar pattern; the only material difference is that the inbound prefixes
are tagged with an ***Upstream Community*** which is of the form `(8298,2,*)` and backwards
compatible `(8298,20XX)`. Downstream customers can use this, if they wish, to select or
reject routes (maybe they don't like routes coming from AS25091, although they should know
better because IP-Max rocks!).
The other slight change here is the `bgp_local_pref` is set to 50, which implies that it will
be used only if there are no alternatives in the _RIB_ with a higher localpref, or with a
similar localpref but shorter as-path, or many other scenarios which I won't get into here,
because BGP selection criteria 101 is a whole blogpost of its own.
## Downstream
That brings us to the third type of BGP sessions -- commonly referred to as customers except
that not everybody pays :) so I just call them _downstreams_:
```
function ebgp_import_downstream(int remote_as) {
# We do not scrub BGP Communities (RFC 7454 Section 11) for customers
return ebgp_import(remote_as);
}
```
Here, I have a special relationship with the `remote_as`, and I do not scrub the communities,
letting the downstream operator set whichever they like. As I'll demonstrate in the next
chapter, they can use these communities to drive certain types of behavior.
Here's how I use this `ebgp_import_downstream()` function in the full filter for a downstream:
```
# bgpq4 -Ab4 -R 24 -m 24 -l 'define AS201723_IPV4' AS201723
define AS201723_IPV4 = [
185.54.95.0/24
];
# bgpq4 -Ab6 -R 48 -m 48 -l 'define AS201723_IPV6' AS201723
define AS201723_IPV6 = [
2001:678:3d4::/48,
2001:67c:6bc::/48
];
filter ebgp_raymon_201723_import {
if (net.type = NET_IP4 && ! (net ~ AS201723_IPV4)) then reject;
if (net.type = NET_IP6 && ! (net ~ AS201723_IPV6)) then reject;
if ! ebgp_import_downstream(201723) then reject;
# Add BGP Large Communities
bgp_large_community.add((8298,3,201723));
# Add BGP Communities
bgp_community.add((8298,3500));
accept;
}
protocol bgp raymon_201723_ipv4_1 {
local as 8298;
source address 185.54.95.250;
neighbor 185.54.95.251 as 201723;
default bgp_med 0;
default bgp_local_pref 400;
ipv4 {
import keep filtered;
import filter ebgp_raymon_201723_import;
export filter ebgp_raymon_201723_export;
receive limit 94 action restart;
next hop self on;
};
};
```
OK, so this is a mouthful, but the one thing that I really need to do with customers is
ensure that I only accept prefixes from them that they're supposed to send me. I do this
with a `prefix-list` for IPv4 and IPv6, and in the importer, I simply reject any prefixes
that are not in the list. From then on, it looks very much like a peer, with identical
filtering and tagging, except now I'm using yet another ***Customer Community*** which
starts with `(8298,3,*)` and a vanilla `(8298,3500)` community. Anybody who wishes to,
can act on the presence of these communities to know that it's a downstream of IPng Networks
AS8298.
***A note on Peers and Downstreams***:
Some ISPs will not peer with their customers (as in: once you become a transit customer
they will terminate all BGP sessions at public internet exchanges), and I find that silly.
However, for me the situation becomes a little bit more complex if I were to have AS201723
both as a Downstream (as shown here) as well as a Peer (which in fact, I do, at multiple Amsterdam
based internet exchanges). Note how the `bgp_local_pref` is 400 on this session, and it
will always be lower on other types of sessions. The implication is that this prefix from the _RIB_
which carries `(8298,3,201723)` will be selected, and the ones I learn from LSIX will
carry `(8298,1,*)` and the ones I learn from A2B (a transit provider) will carry `(8298,2,51088)`
and both will not be selected due to those having a lower localpref. As I'll demonstrate below,
I can make smart use of these communities when announcing prefixes to my own peers and upstreams,
... read on :)
## 3. Outbound: Announcing Routes
Alright, the _RIB_ is now filled with lots of prefixes that have the right localpref and
communities, for example from having been learned at an IXP, from an Upstream, or from a
Downstream. Now let's consider the following generic exporter:
```
function ebgp_export(int remote_as) {
# Remove private ASNs
bgp_path.delete([64512..65535, 4200000000..4294967295]);
# Well known BGP Large Communities
if (8298, 0, remote_as) ~ bgp_large_community then return false;
if (8298, 0, 0) ~ bgp_large_community then return false;
# Well known BGP Communities
if (0, 8298) ~ bgp_community then return false;
if (remote_as < 65536 && (0, remote_as) ~ bgp_community) then return false;
# AS path prepending
if ((8298, 103, remote_as) ~ bgp_large_community ||
(8298, 103, 0) ~ bgp_large_community) then {
bgp_path.prepend( bgp_path.first );
bgp_path.prepend( bgp_path.first );
bgp_path.prepend( bgp_path.first );
} else if ((8298, 102, remote_as) ~ bgp_large_community ||
(8298, 102, 0) ~ bgp_large_community) then {
bgp_path.prepend( bgp_path.first );
bgp_path.prepend( bgp_path.first );
} else if ((8298, 101, remote_as) ~ bgp_large_community ||
(8298, 101, 0) ~ bgp_large_community) then {
bgp_path.prepend( bgp_path.first );
}
return true;
}
```
Oh, wow! There's some really cool stuff to unpack here. As a belt-and-braces type safety,
I will remove any private AS numbers from the as-path - this avoids my own announcements
from tripping any as-path bogon filtering. But then, there's a few well-known communities
that help determine if the announcement is made or not, and there are three-and-a-half
ways of doing this:
1. `(8298,0,remote_as)`
1. `(8298,0,0)`
1. `(0,8298)`
1. `(0,remote_as)` but only if the remote_as is 16 bits.
All four of these methods will tell the router to refuse announcing the prefix on this
session. Note that downstreams are allowed to set `(8298,*,*)` and `(8298,*)` communities
(and they're the only ones who are allowed to do so). So here is where some of the cool
magic starts to happen.
Then, to drive prepending of the prefix on this session, I'll again match certain
communities `(8298, 103, *)` will prepend the customer's AS number three times, using
`102` will prepend twice, and `101` will prepend once. If the third digit is `0`, then
any session with this filter will prepend. If the third digit is the AS number, then
only sessions to this AS number will be prepended.
Using these types of communities allow downstream (customers) incredibly fine grained
propagation actions, at the per-IPng-session level. Not many ISPs offer this functionality!
### Peers
Exporting to peers, I really need to make sure that I don't send too many prefixes. Most
of us have at some point gone through the embarassing motions of being told by a fellow
operator "hey you're sending a full table". It is paramount to good peering hygiene
that I do not leak. So I'll define a healthy set of _defense in depth_ principles here:
```
# bgpq4 -A4b -R 24 -m 24 -l 'define AS8298_IPV4' AS8298
define AS8298_IPV4 = [ 92.119.38.0/24, 194.1.163.0/24, 194.126.235.0/24 ];
# bgpq4 -A6bR 48 -m 48 -l 'define AS8298_IPV6' AS8298
define AS8298_IPV6 = [ 2001:678:d78::/48, 2a0b:dd80::/29{29,48} ];
# bgpq4 -A4b -R 24 -m 24 -l 'define AS_IPNG_IPV4' AS-IPNG
define AS_IPNG_IPV4 = [ ... ## Removed for brevity ];
# bgpq4 -A6bR 48 -m 48 -l 'define AS_IPNG_IPV6' AS-IPNG
define AS_IPNG_IPV6 = [ .. ## Removed for brevity ];
# bgpq4 -t4b -l 'define AS_IPNG' AS-IPNG
define AS_IPNG = [112, 8298, 50869, 57777, 60557, 201723, 212323, 212855];
function aspath_first_valid() {
return (bgp_path.len = 0 || bgp_path.first ~ AS_IPNG);
}
# A list of well-known tier1 transit providers
function aspath_contains_tier1() {
return bgp_path ~ [
174, # Cogent
209, # Qwest (HE carries this on IXPs IPv6 (Jul 12 2018))
701, # UUNET
702, # UUNET
1239, # Sprint
1299, # Telia
2914, # NTT Communications
3257, # GTT Backbone
3320, # Deutsche Telekom AG (DTAG)
3356, # Level3
3549, # Level3
3561, # Savvis / CenturyLink
4134, # Chinanet
5511, # Orange opentransit
6453, # Tata Communications
6762, # Seabone / Telecom Italia
7018 ]; # AT&T
}
# The list of our own uplink (transit) providers
# Note: This list is autogenerated by our automation.
function aspath_contains_upstream() {
return bgp_path ~ [ 8283,25091,34549,51088,58299 ];
}
function ipv4_prefix_valid() {
# Our (locally sourced) prefixes
if (net ~ AS8298_IPV4) then return true;
# Customer prefixes in AS-IPNG must be tagged with customer community
if (net ~ AS_IPNG_IPV4 &&
(bgp_large_community ~ [(8298, 3, *)] || bgp_community ~ [(8298, 3500)])
) then return true;
return false;
}
function ipv6_prefix_valid() {
# Our (locally sourced) prefixes
if (net ~ AS8298_IPV6) then return true;
# Customer prefixes in AS-IPNG must be tagged with customer community
if (net ~ AS_IPNG_IPV6 &&
(bgp_large_community ~ [(8298, 3, *)] || bgp_community ~ [(8298, 3500)])
) then return true;
return false;
}
function prefix_valid() {
# as-path based filtering
if !aspath_first_valid() then return false;
if aspath_contains_tier1() then return false;
if aspath_contains_upstream() then return false;
# prefix (and BGP community) based filtering
if (net.type = NET_IP4 && !ipv4_prefix_valid()) then return false;
if (net.type = NET_IP6 && !ipv6_prefix_valid()) then return false;
return true;
}
function ebgp_export_peer(int remote_as) {
if !prefix_valid() then return false;
return ebgp_export(remote_as);
}
```
Wow, alrighty then!! All I'm doing here is checking if the call to `prefix_valid()`
returns true. That function isn't very complex. It takes a look at three as-path based
filters and then a prefix-list based filter. Let's go over them in turn:
***aspath_first_valid()*** takes a look at the first hop in the as-path. I need to
make sure that I've received this prefix from an actual downstream, and those are
collected in a RIPE `as-set` called `AS-IPNG`. So if the first BGP hop in the path is
not one of these, I'll refuse to announce the prefix.
***aspath_contains_tier1()*** is a belt-and-braces style check. How on earth would
I provide transit for any prefix for which there's already a global _Tier1_ provider
in the path? I mean, in no universe would AS174 or AS1299 need me to reach any of
their customers, or indeed, any place in the world. So this filter helps me never
announce the prefix, if it has one of these ISPs in the path.
***aspath_contains_upstream()*** similarly, if I am receiving a full table from an
upstream provider, I should not be passing this prefix along - I would for similar
reasons never be a transit provider for A2B or IP-Max or Meerfarbig. Due to a bug
in my configuration, my buddy Erik kindly pointed out this issue to me, so hat-tip
to him for the intelligence.
***ipv[46]_prefix_valid()*** is the main thrust of prefix-based filtering. At this
point we've already established that the as-path is clean, but it could be that
the downstream is sending prefixes they should not (possibly leaking a full table)
so let's take a look at a good way to avoid this.
* First, we look at locally sourced routes from `AS8298`, that is the ones that I
myself originate at IPng Networks. These are always OK. The list is carefully
curated.
* Alternatively, the prefix needs to be from the as-set `AS-IPNG` (which contains
both my prefixes and all `route` and `route6` objects belonging to any AS number
that I consider a downstream),
* Finally, if the prefix is from `AS-IPNG`, I'll still add one additional check to
ensure that there is a so-called _customer community_ attached. Remember that I
discused this specifically up in the ***Inbound - Downstream*** section.
So before I were to announce anything on such a session, all _four_ of as-path,
inbound prefix-list, outbound prefix-list and bgp-community are checked. This
makes it incredibly unlikely that AS8298 ever leaks prefixes -- knock on wood!
### Upstream
Interestingly and if you think about it, unsurprisingly, an upstream configuration
is exactly identical to a peer:
```
function ebgp_export_upstream(int remote_as) {
if !prefix_valid() then return false;
return ebgp_export(remote_as);
}
```
Alright, nothing to see here, moving on ...
### Downstream
Now the difference between a Peer and an Upstream on the one hand, and a Downstream
on the other, is that the former two will only see a very limited set of prefixes,
heavily guarded by all of that filtering I described. But a downstream typically
has the luxury of getting to learn every prefix I've learned:
```
function ipv4_acceptable_size() {
if net.len < 8 then return false;
if net.len > 24 then return false;
return true;
}
function ipv6_acceptable_size() {
if net.len < 12 then return false;
if net.len > 48 then return false;
return true;
}
function ebgp_export_downstream(int remote_as) {
if (source != RTS_BGP && source != RTS_STATIC) then return false;
if (net.type = NET_IP4 && ! ipv4_acceptable_size()) then return false;
if (net.type = NET_IP6 && ! ipv6_acceptable_size()) then return false;
return ebgp_export(remote_as);
}
```
So here I'll assert that the prefix has to be either from the `RTS_BGP` source, or
from the `RTS_STATIC` source. This latter source is what Bird uses for locally
generated routes (ie. the ones in AS8298 itself). Locally generated routes are not
known from BGP, but known instead because they are blackholed / null-routed on the
router itself. And from these routes, I further deselect those prefixes that are
too short or too long, which are slightly different based on address family (IPv4
is anywhere between /8-/24 and for IPv6 is anywhere between /12-/48).
Now, I will note that I've seen many operators who inject OSPF or connected or
static routes into BGP, and all of those folks will have to maintain elaborate egress
"bogon" route filters, for example for those IXP prefixes that they picked up due to
them being directly connected. If those operators would simply not propagate directly
connected routes, their life would be so much simpler .. but I digress and it's time
for me to wrap up.
## Epilog
I hope this little dissertation proves useful for other Bird enthusiasts out there.
I myself had to fiddle a bit over the years with the idiosyncracies (and bugs) of
Bird and Bird2. I wanted to make a few comments:
1. Thanks to the crew at [Coloclue](https://coloclue.net/) for having a really phenomenal
routing setup, with a lot of thoughtful documentation, action communities, and strict
ingress and egress filtering. It's also fully automated and I've derived, although
completely rewritten, my own automation based off of [Kees](https://github.com/coloclue/kees).
1. I understand that the main destinction on inbound Peer and Upstream, is that for Peers
many folks will want to do strict filtering. I've considered this for a long time and
ultimately decided against it, because a combination of max prefix, tier1 as-path filtering
and RPKI filtering would take care of the most egregious mistakes and otherwise, I'm actually
happy to get more prefixes via IXPs rather than less.

View File

@ -0,0 +1,474 @@
---
date: "2021-11-26T08:51:23Z"
title: 'Review: Netgate 6100'
---
* Author: Pim van Pelt <[pim@ipng.nl](mailto:pim@ipng.nl)>
* Reviewed: Jim Thompson <[jim@netgate.com](mailto:jim@netgate.com)>
* Status: Draft - Review - **Approved**
A few weeks ago, Jim Thompson from Netgate stumbled across my [APU6 Post]({% post_url 2021-07-19-pcengines-apu6 %})
and introduced me to their new desktop router/firewall the Netgate 6100. It currently ships with
[pfSense Plus](https://www.netgate.com/pfsense-plus-software), but he mentioned that it's designed
as well to run their [TNSR](https://www.netgate.com/tnsr) software, considering the device ships
with 2x 1GbE SFP/RJ45 combo, 2x 10GbE SFP+, and 4x 2.5GbE RJ45 ports, and all network interfaces
are Intel / DPDK capable chips. He asked me if I was willing to take it around the block with
VPP, which of course I'd be happy to do, and here are my findings. The TNSR image isn't yet
public for this device, but that's not a problem because [AS8298 runs VPP]({% post_url 2021-09-10-vpp-6 %}),
so I'll just go ahead and install it myself ...
# Executive Summary
The Netgate 6100 router running pfSense has a single core performance of 623kpps and a total chassis
throughput of 2.3Mpps, which is sufficient for line rate _in both directions_ at 1514b packets (1.58Mpps),
about 6.2Gbit of _imix_ traffic, and about 419Mbit of 64b packets. Running Linux on the router yields
very similar results.
With VPP though, the router's single core performance leaps to 5.01Mpps at 438 CPU cycles/packet. This
means that all three of 1514b, _imix_ and 64b packets can be forwarded at line rate in one direction
on 10Gbit. Due to its Atom C3558 processor (which has 4 cores, 3 of which are dedicated to VPP's worker
threads, and 1 to its main thread and controlplane running in LInux), achieving 10Gbit line rate in
both directions when using 64 byte packets, is not possible.
Running at 19W and a total forwarding **capacity of 15.1Mpps**, it consumes only **_1.26&micro;J_ of
energy per forwarded packet**, while at the same time easily handling a full BGP table with room to
spare. I find this Netgate 6100 appliance pretty impressive and when TNSR becomes available,
performance will be similar to what I've tested here, at a pricetag of USD 699,-
## Detailed findings
{{< image width="400px" float="left" src="/assets/netgate-6100/netgate-6100-back.png" alt="Netgate 6100" >}}
The [Netgate 6100](https://www.netgate.com/blog/introducing-the-new-netgate-6100) ships with an
Intel Atom C-3558 CPU (4 cores including AES-NI and QuickAssist), 8GB of memory and either 16GB of
eMMC, or 128GB of NVME storage. The network cards are its main fort&eacute;: it comes with 2x i354
gigabit combo (SFP and RJ45), 4x i225 ports (these are 2.5GbE), and 2x X553 10GbE ports with an SFP+
cage each, for a total of 8 ports and lots of connectivity.
The machine is fanless and this is made possible by its power efficient CPU: the Atom here runs at 16W
TDP only, and the whole machine clocks in at a very efficient 19W. It comes with an external power brick,
but only one power supply, so no redundancy, unfortunately. To make up for that small omission, here are
a few nice touches that I noticed:
* The power jack has a screw-on barrel - no more accidentally rebooting the machine when fumbling around under the desk.
* There's both a Cisco RJ45 console port (115200,8n1), as well as a CP2102 onboard USB/serial connector,
which means you can connect to its serial port as well with a standard issue micro-USB cable. Cool!
### Battle of Operating Systems
Netgate ships the device with pfSense - it's a pretty great appliance and massively popular - delivering
firewall, router and VPN functionality to homes and small business across the globe. I myself am partial
to BSD (albeit a bit more of the Puffy persuasion), but DPDK and VPP are more of a Linux cup of tea. So
I'll have to deface this little guy, and reinstall it with Linux. My game plan is:
1. Based on the shipped pfSense 21.05 (FreeBSD 12.2), do all the loadtests
1. Reinstall the machine with Linux (Ubuntu 20.04.3), do all the loadtests
1. Install VPP using my own [HOWTO]({% post_url 2021-09-21-vpp-7 %}), and do all the loadtests
This allows for, I think, a pretty sweet comparison between FreeBSD, Linux, and DPDK/VPP. Now, on to a
description on the defacing, err, reinstall process on this Netgate 6100 machine, as it was not as easy
as I had anticipated (but is it ever easy, really?)
{{< image width="400px" float="right" src="/assets/netgate-6100/blinkboot.png" alt="Blinkboot" >}}
Turning on the device, it presents me with some BIOS firmware from Insyde Software which
is loading some software called _BlinkBoot_ [[ref](https://www.insyde.com/products/blinkboot)], which
in turn is loading modules called _Lenses_, pictured right. Anyway, this ultimately presents
me with a ***Press F2 for Boot Options***. Aha! That's exactly what I'm looking for. I'm really
grateful that Netgate decides to ship a device with a BIOS that will allow me to boot off of other
media, notably the USB stick in order to [reinstall pfSense](https://docs.netgate.com/pfsense/en/latest/solutions/netgate-6100/reinstall-pfsense.html)
but in my case, also to install another operating system entirely.
My first approach was to get a default image to boot off of USB (the device has two USB3 ports on the
side). But none of the USB ports want to load my UEFI `bootx64.efi` prepared USB key. So my second
attempt was to prepare a PXE boot image, taking a few hints from Ubuntu's documentation [[ref](https://wiki.ubuntu.com/UEFI/PXE-netboot-install)]:
```
wget http://archive.ubuntu.com/ubuntu/dists/focal/main/installer-amd64/current/legacy-images/netboot/mini.iso
mv mini.iso /tmp/mini-focal.iso
grub-mkimage --format=x86_64-efi \
--output=/var/tftpboot/grubnetx64.efi.signed \
--memdisk=/tmp/mini-focal.iso \
`ls /usr/lib/grub/x86_64-efi | sed -n 's/\.mod//gp'`
```
After preparing DHCPd and a TFTP server, and getting a slight feeling of being transported back in time
to the stone age, I see the PXE both request an IPv4 address, and the image I prepared. And, it boots, yippie!
```
Nov 25 14:52:10 spongebob dhcpd[43424]: DHCPDISCOVER from 90:ec:77:1b:63:55 via bond0
Nov 25 14:52:11 spongebob dhcpd[43424]: DHCPOFFER on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0
Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPREQUEST for 192.168.1.206 (192.168.1.254) from 90:ec:77:1b:63:55 via bond0
Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPACK on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0
Nov 25 15:04:56 spongebob tftpd[2076]: 192.168.1.206: read request for 'grubnetx64.efi.signed'
```
I took a peek while the `grubnetx64` was booting, and saw that the available output terminals
on this machine are `spkmodem morse gfxterm serial_efi0 serial_com0 serial cbmemc audio`, and that
the default/active one is `console`, so I make a note that Grub wants to run on 'console' (and
specifically NOT on 'serial', as is usual, see below for a few more details on this) while the Linux
kernel will of course be running on serial, so I have to add `console=ttyS0,115200n8` to the kernel boot
string before booting.
Piece of cake, by which I mean I spent about four hours staring at the boot loader and failing to get
it quite right -- pro-tip: install OpenSSH and fix the GRUB and Kernel configs before finishing the
`mini.iso` install:
```
mount --bind /proc /target/proc
mount --bind /dev /target/dev
mount --bind /sys /target/sys
chroot /target /bin/bash
# Install OpenSSH, otherwise the machine boots w/o access :)
apt update
apt install openssh-server
# Fix serial for GRUB and Kernel
vi /etc/default/grub
## set GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0,115200n8"
## set GRUB_TERMINAL=console (and comment out the serial stuff)
grub-install /dev/sda
update-grub
```
Rebooting now brings me to Ubuntu: Pat on the back, Caveman Pim, you've still got it!
## Network Loadtest
After that small but exciting detour, let me get back to the loadtesting. The choice of Intel's
network controller on this board allows me to use Intel's DPDK with relatively high
performance, compared to regular (kernel) based routing. I loadtested the stock firmware pfSense
(21.05, based on FreeBSD 12.2), Linux (Ubuntu 20.04.3), and VPP (22.02, [[ref](https://fd.io/)]).
Specifically worth calling out is that while Linux and FreeBSD struggled in the packets-per-second
department, the use of DPDK in VPP meant absolutely no problems filling a unidirectional 10G stream
of "regular internet traffic" (referred to as `imix`), it was also able to fill _line rate_ with
"64b UDP packets", with just a little headroom there, but it ultimately struggled with _bidirectional_
64b UDP packets.
### Methodology
For the loadtests, I used Cisco's T-Rex [[ref](https://trex-tgn.cisco.com/)] in stateless mode,
with a custom Python controller that ramps up and down traffic from the loadtester to the _device
under test_ (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be
presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back
in on `port0`). The loadtester first sends a few seconds of warmup, this is to ensure the DUT is
passing traffic and offers the ability to inspect the traffic before the actual rampup. Then
the loadtester ramps up linearly from zero to 100% of line rate (in this case, line rate is
10Gbps in both directions), finally it holds the traffic at full line rate for a certain
duration. If at any time the loadtester fails to see the traffic it's emitting return on its
second port, it flags the DUT as saturated; and this is noted as the maximum bits/second and/or
packets/second.
Since my last loadtesting [post]({% post_url 2021-07-19-pcengines-apu6 %}), I've learned a lot
more about packet forwarding and how to make it easier or harder on the router. Let me go into a
few more details about the various loadtests that I've done here.
#### Method 1: Single CPU Thread Saturation
Most kernels (certainly OpenBSD, FreeBSD and Linux) will make use of multiple receive queues
if the network card supports it. The Intel NICs in this machine are all capable of _Receive Side
Scaling_ (RSS), which means the NIC can offload its packets into multiple queues. The kernel
will typically enable one queue for each CPU core -- the Atom has 4 cores, so 4 queues are
initialized, and inbound traffic is sent, typically using some hashing function, to individual
CPUs, allowing for a higher aggregate throughput.
Mostly, this hashing function is based on some L3 or L4 payload, for example a hash over
the source IP/port and destination IP/port. So one interesting test is to send **the same packet**
over and over again -- the hash function will then return the same value for each packet, which
means all traffic goes into exactly one of the N available queues, and therefore handled by
only one core.
One such TRex stateless traffic profile is `udp_1pkt_simple.py` which, as the name implies,
simply sends the same UDP packet from source IP/port and destination IP/port, padded with
a bunch of 'x' characters, over and over again:
```
packet = STLPktBuilder(pkt =
Ether()/
IP(src="16.0.0.1",dst="48.0.0.1")/
UDP(dport=12,sport=1025)/(10*'x')
)
```
#### Method 2: Rampup using trex-loadtest.py
TRex ships with a very handy `bench.py` stateless traffic profile which, without any additional
arguments, does the same thing as the above method. However, this profile optionally takes a few
arguments, which are called _tunables_, notably:
* ***size*** - set the size of the packets to either a number (ie. 64, the default, or any number
up to a maximum of 9216 byes), or the string `imix` which will send a traffic mix consisting of
60b, 590b and 1514b packets.
* ***vm*** - set the packet source/dest generation. By default (when the flag is `None`), the
same src (16.0.0.1) and dst (48.0.0.1) is set for each packet. When setting the value to
`var1`, the source IP is incremented from `16.0.0.[4-254]`. If the value is set to
`var2`, the source _and_ destination IP are incremented, the destination from `48.0.0.[4-254]`.
So tinkering with the `vm` parameter is an excellent way of driving one or many receive queues. Armed
with this, I will perform a loadtest with four modes of interest, from easier to more challenging:
1. ***bench-var2-1514b***: multiple flows, ~815Kpps at 10Gbps; this is the easiest test to perform,
as the traffic consists of large (1514 byte) packets, and both source and destination are
different each time, which means lots of multiplexing across receive queues, and relatively few
packets/sec.
1. ***bench-var2-imix***: multiple flows, with a mix of 60, 590 and 1514b frames in a certain ratio. This
yields what can be reasonably expected from _normal internet use_, just about 3.2Mpps at
10Gbps. This is the most representative test for normal use, but still the packet rate is quite
low due to (relatively) large packets. Any respectable router should be able to perform well at
an imix profile.
1. ***bench-var2-64b***: Still multiple flows, but very small packets, 14.88Mpps at 10Gbps,
often refered to as the theoretical maximum throughput on Tengig. Now it's getting harder, as
the loadtester will fill the line with small packets (of 64 bytes, the smallest that an ethernet
packet is allowed to be). This is a good way to see if the router vendor is actually capable of
what is referred to as _line rate_ forwarding.
1. ***bench***: Now restricted to a constant src/dst IP:port tuple, and the same rate of
14.88Mpps at 10Gbps, means only one Rx queue (and thus, one CPU core) can be used. This is where
single-core performance becomes relevant. Notably, vendors who boast many CPUs, will often struggle
with a test like this, in case any given CPU core cannot individually handle a full line rate.
I'm looking at you, Tilera!
Further to this list, I can send traffic in one direction only (TRex will emit this from its
port0 and expect the traffic to be seen back at port1); or I can send it ***in both directions***.
The latter will double the packet rate and bandwidth, to approx 29.7Mpps.
***NOTE***: At these rates, TRex can be a bit fickle trying to fit all these packets into its own
transmit queues, so I decide to drive it a bit less close to the cliff and stop at 97% of line rate
(this is 28.3Mpps). It explains why lots of these loadtests top out at that number.
### Results
#### Method 1: Single CPU Thread Saturation
Given the approaches above, for the first method I can "just" saturate the line and see how many packets
emerge through the DUT on the other port, so that's only 3 tests:
Netgate 6100 | Loadtest | Throughput (pps) | L1 Throughput (bps) | % of linerate
------------- | --------------- | ---------------- | -------------------- | -------------
pfSense | 64b 1-flow | 622.98 Kpps | 418.65 Mbps | 4.19%
Linux | 64b 1-flow | 642.71 Kpps | 431.90 Mbps | 4.32%
***VPP*** | ***64b 1-flow***| ***5.01 Mpps*** | ***3.37 Gbps*** | ***33.67%***
***NOTE***: The bandwidth figures here are so called _L1 throughput_ which means bits on the wire, as opposed
to _L2 throughput_ which means bits in the ethernet frame. This is relevant particularly at 64b loadtests as
the overhead for each ethernet frame is 20 bytes (7 bytes preamble, 1 byte start-frame, and 12 bytes inter-frame gap
[[ref](https://en.wikipedia.org/wiki/Ethernet_frame)]). At 64 byte frames, this is 31.25% overhead! It also
means that when L1 bandwidth is fully consumed at 10Gbps, that the observed L2 bandwidth will be only 7.62Gbps.
#### Interlude - VPP efficiency
In VPP it can be pretty cool to take a look at efficiency -- one of the main reasons why it's so quick is because
VPP will consume the entire core, and grab ***a set of packets*** from the NIC rather than do work for each
individual packet. VPP then advances the set of packets, called a _vector_, through a directed graph. The first
of these packets will result in the code for the current graph node to be fetched into the CPU's instruction cache,
and the second and further packets will make use of the warmed up cache, greatly improving per-packet efficiency.
I can demonstrate this by running a 1kpps, 1Mpps and 10Mpps test against the VPP install on this router, and
observing how many CPU cycles each packet needs to get forwarded from the input interface to the output interface.
I expect this number _to go down_ when the machine has more work to do, due to the higher CPU i/d-cache hit rate.
Seeing the time spent in each of VPP's graph nodes, and for each individual worker thread (which correspond 1:1
with CPU cores), can be done with `vppctl show runtime` command and some `awk` magic:
```
$ vppctl clear run && sleep 30 && vppctl show run | \
awk '$2 ~ /active|polling/ && $4 > 25000 {
print $0;
if ($1=="ethernet-input") { packets = $4};
if ($1=="dpdk-input") { dpdk_time = $6};
total_time += $6
} END {
print packets/30, "packets/sec, at",total_time,"cycles/packet,",
total_time-dpdk_time,"cycles/packet not counting DPDK"
}'
```
This gives me the following, somewhat verbose but super interesting output, which I've edited down to fit on screen,
and omit the columns that are not super relevant. Ready? Here we go!
```
tui>start -f stl/udp_1pkt_simple.py -p 0 -m 1kpps
Graph Node Name Clocks Vectors/Call
----------------------------------------------------------------
TenGigabitEthernet3/0/1-output 6.07e2 1.00
TenGigabitEthernet3/0/1-tx 8.61e2 1.00
dpdk-input 1.51e6 0.00
ethernet-input 1.22e3 1.00
ip4-input-no-checksum 6.59e2 1.00
ip4-load-balance 4.50e2 1.00
ip4-lookup 5.63e2 1.00
ip4-rewrite 5.83e2 1.00
1000.17 packets/sec, at 1514943 cycles/packet, 4943 cycles/pkt not counting DPDK
```
I'll observe that a lot of time is spent in `dpdk-input`, because that is a node that is constantly polling
the network card, as fast as it can, to see if there's any work for it to do. Apparently not, because the average
vectors per call is pretty much zero, and considering that, most of the CPU time is going to sit in a nice "do
nothing". Because reporting CPU cycles spent doing nothing isn't particularly interesting, I shall report on
both the total cycles spent, that is to say including DPDK, and as well the cycles spent per packet in the
_other active_ nodes. In this case, at 1kpps, VPP is spending 4953 cycles on each packet.
Now, take a look what happens when I raise the traffic to 1Mpps:
```
tui>start -f stl/udp_1pkt_simple.py -p 0 -m 1mpps
Graph Node Name Clocks Vectors/Call
----------------------------------------------------------------
TenGigabitEthernet3/0/1-output 3.80e1 18.57
TenGigabitEthernet3/0/1-tx 1.44e2 18.57
dpdk-input 1.15e3 .39
ethernet-input 1.39e2 18.57
ip4-input-no-checksum 8.26e1 18.57
ip4-load-balance 5.85e1 18.57
ip4-lookup 7.94e1 18.57
ip4-rewrite 7.86e1 18.57
981830 packets/sec, at 1770.1 cycles/packet, 620 cycles/pkt not counting DPDK
```
Whoa! The system is now running the VPP loop with ~18.6 packets per vector, and you can clearly see that
the CPU efficiency went up greatly, from 4953 cycles/packet at 1kpps, to 620 cycles/packet at 1Mpps.
That's an order of magnitude improvement!
Finally, let's give this Netgate 6100 router a run for its money, and slam it with 10Mpps:
```
tui>start -f stl/udp_1pkt_simple.py -p 0 -m 10mpps
Graph Node Name Clocks Vectors/Call
----------------------------------------------------------------
TenGigabitEthernet3/0/1-output 1.41e1 256.00
TenGigabitEthernet3/0/1-tx 1.23e2 256.00
dpdk-input 7.95e1 256.00
ethernet-input 6.74e1 256.00
ip4-input-no-checksum 3.95e1 256.00
ip4-load-balance 2.54e1 256.00
ip4-lookup 4.12e1 256.00
ip4-rewrite 4.78e1 256.00
5.01426e+06 packets/sec, at 437.9 cycles/packet, 358 cycles/pkt not counting DPDK
```
And here is where I learn the maximum packets/sec that this one CPU thread can handle: ***5.01Mpps***, at which
point every packet is super efficiently handled at 358 CPU cycles each, or 13.8 times (4953/438)
as efficient under high load than when the CPU is unloaded. Sweet!!
Another really cool thing to do here is derive the effective clock speed of the Atom CPU. We know it runs at
2200Mhz, and we're doing 5.01Mpps at 438 cycles/packet including the time spent in DPDK, which adds up to 2194MHz,
remarkable precision. Color me impressed :-)
#### Method 2: Rampup using trex-loadtest.py
For the second methodology, I have to perform a _lot_ of loadtests. In total, I'm testing 4 modes (1514b, imix,
64b-multi and 64b 1-flow), then take a look at unidirectional traffic and bidirectional traffic, and perform each
of these loadtests on pfSense, Ubuntu, and VPP with one, two or three Rx/Tx queues. That's a total of 40
loadtests!
Loadtest | pfSense | Ubuntu | VPP 1Q | VPP 2Q | VPP 3Q | Details
--------- | -------- | -------- | ------- | ------- | ------- | ----------
***Unidirectional*** |
1514b | 97% | 97% | 97% | 97% | ***97%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-1514b-unidirectional.html)]
imix | 61% | 75% | 96% | 95% | ***95%*** | [[graphs](/assets/netgate-6100/netgate-6100.imix-unidirectional.html)]
64b | 15% | 17% | 33% | 66% | ***96%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-64b-unidirectional.html)]
64b 1-flow| 4.4% | 4.7% | 33% | 33% | ***33%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-unidirectional.html)]
***Bidirectional*** |
1514b | 192% | 193% | 193% | 193% | ***194%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-1514b-bidirectional.html)]
imix | 63% | 71% | 190% | 190% | ***191%*** | [[graphs](/assets/netgate-6100/netgate-6100.imix-bidirectional.html)]
64b | 15% | 16% | 61% | 63% | ***81%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-64b-bidirectional.html)]
64b 1-flow| 8.6% | 9.0% | 61% | ***61%*** | 33% (+) | [[graphs](/assets/netgate-6100/netgate-6100.bench-bidirectional.html)]
A picture says a thousand words - so I invite you to take a look at the interactive graphs from the table
above. I'll cherrypick what I find the most interesting one here:
{{< image width="1000px" src="/assets/netgate-6100/bench-var2-64b-unidirectional.png" alt="Netgate 6100" >}}
The graph above is of the unidirectional _64b_ loadtest. Some observations:
* pfSense 21.05 (running FreeBSD 12.2, the bottom blue trace), and Ubuntu 20.04.3 (running Linux 5.13, the
orange trace just above it) are are equal performers. They handle fullsized (1514 byte) packets just fine,
struggle a little bit with imix, and completely suck at 64b packets (shown here), in particular if only 1
CPU core can be used.
* Even at 64b packets, VPP scales linearly from 33% of line rate with 1Q (the green trace), 66% with 2Q (the
red trace) and 96% with 3Q (the purple trace, that makes it through to the end).
* With VPP taking 3Q, one CPU is left over for the main thread and controlplane software like FRR or Bird2.
## Caveats
The unit was shipped courtesy of Netgate (thanks again! Jim, this was fun!) for the purposes of
load- and systems integration testing and comparing their internal benchmarking with my findings.
Other than that, this is not a paid endorsement and views of this review are my own.
One quirk I noticed is that while running VPP with 3Q and bidirectional traffic, performance is much worse
than with 2Q or 1Q. This is not a fluke with the loadtest, as I have observed the same strange performance
with other machines (Supermicro 5018D-FN8T for example). I confirmed that each VPP worker thread is used
for each queue, so I would've expected ~15Mpps shared by both interfaces (so a per-direction linerate of ~50%),
but I get 16.8% instead [[graphs](/assets/netgate-6100/netgate-6100.bench-bidirectional.html)]. I'll have to
understand that better, but for now I'm releasing the data as-is.
## Appendix
### Generating the data
You can find all of my loadtest runs in [this archive](/assets/netgate-6100/trex-loadtest-json.tar.gz).
The archive contains the `trex-loadtest.py` script as well, for curious readers!
These JSON files can be fed directly into Michal's [visualizer](https://github.com/wejn/trex-loadtest-viz)
to plot interactive graphs (which I've done for the table above):
```
DEVICE=netgate-6100
ruby graph.rb -t 'Netgate 6100 All Loadtests' -o ${DEVICE}.html netgate-*.json
for i in bench-var2-1514b bench-var2-64b bench imix; do
ruby graph.rb -t 'Netgate 6100 Unidirectional Loadtests' --only-channels 0 \
netgate-*-${i}-unidi*.json -o ${DEVICE}.$i-unidirectional.html
done
for i in bench-var2-1514b bench-var2-64b bench imix; do
ruby graph.rb -t 'Netgate 6100 Bidirectional Loadtests' \
netgate-*-${i}.json -o ${DEVICE}.$i-bidirectional.html
done
```
### Notes on pfSense
I'm not a pfSense user, but I know my way around FreeBSD just fine. After installing the firmware, I
simply choose the 'Give me a Shell' option, and take it from there. The router will run `pf` out of
the box, and it is pretty complex, so I'll just configure some addresses, routes and disable the
firewall alltogether. That sounds just fair, as the same tests with Linux and VPP also do not use
a firewall (even though obviously, both VPP and Linux support firewalls just fine).
```
ifconfig ix0 inet 100.65.1.1/24
ifconfig ix1 inet 100.65.2.1/24
route add -net 16.0.0.0/8 100.65.1.2
route add -net 48.0.0.0/8 100.65.2.2
pfctl -d
```
### Notes on Linux
When doing loadtests on Ubuntu, I have to ensure irqbalance is turned off, otherwise the kernel will
thrash around re-routing softirq's between CPU threads, and at the end of the day, I'm trying to saturate
all CPUs anyway, so balancing/moving them around doesn't make any sense. Further, Linux wants to configure
a static ARP entry for the interfaces from TRex:
```
sudo systemctl disable irqbalance
sudo systemctl stop irqbalance
sudo systemctl mask irqbalance
sudo ip addr add 100.65.1.1/24 dev enp3s0f0
sudo ip addr add 100.65.2.1/24 dev enp3s0f1
sudo ip nei replace 100.65.1.2 lladdr 68:05:ca:32:45:94 dev enp3s0f0 ## TRex port0
sudo ip nei replace 100.65.2.2 lladdr 68:05:ca:32:45:95 dev enp3s0f1 ## TRex port1
sudo ip ro add 16.0.0.0/8 via 100.65.1.2
sudo ip ro add 48.0.0.0/8 via 100.65.2.2
```
On Linux, I now see a reasonable spread of IRQs by CPU while doing a unidirectional loadtest:
```
root@netgate:/home/pim# cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3
HI: 3 0 0 1
TIMER: 203788 247280 259544 401401
NET_TX: 8956 8373 7836 6154
NET_RX: 22003822 19316480 22526729 19430299
BLOCK: 2545 3153 2430 1463
IRQ_POLL: 0 0 0 0
TASKLET: 5084 60 1830 23
SCHED: 137647 117482 56371 49112
HRTIMER: 0 0 0 0
RCU: 11550 9023 8975 8075
```

View File

@ -0,0 +1,529 @@
---
date: "2021-12-23T17:58:14Z"
title: VPP Linux CP - Virtual Machine Playground
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. One thing notably missing, is the higher level control plane, that is
to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a
VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network
devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols
like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet
forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use
VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the
interface state (links, addresses and routes) itself. When the plugin is completed, running software
like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
&gt;100Mpps and &gt;100Gbps forwarding rates will be well in reach!
Before we head off into the end of year holidays, I thought I'd make good on a promise I made a while
ago, and that's to explain how to create a Debian (Buster or Bullseye), or Ubuntu (Focal Fossai LTS)
virtual machine running in Qemu/KVM into a working setup with both [Free Range Routing](https://frrouting.org/)
and [Bird](https://bird.network.cz/) installed side by side.
**NOTE**: If you're just interested in the resulting image, here's the most pertinent information:
> * ***vpp-proto.qcow2.lrz [[Download](https://ipng.ch/media/vpp-proto/vpp-proto-bookworm-20231015.qcow2.lrz)]***
> * ***SHA256*** `bff03a80ccd1c0094d867d1eb1b669720a1838330c0a5a526439ecb1a2457309`
> * ***Debian Bookworm (12.4)*** and ***VPP 24.02-rc0~46-ga16463610e***
> * ***CPU*** Make sure the (virtualized) CPU supports AVX
> * ***RAM*** The image needs at least 4GB of RAM, and the hypervisor should support hugepages and AVX
> * ***Username***: `ipng` with ***password***: `ipng loves vpp` and is sudo-enabled
> * ***Root Password***: `IPng loves VPP`
Of course, I do recommend that you change the passwords for the `ipng` and `root` user as soon as you
boot the VM. I am offering the KVM images as-is and without any support. [Contact](/s/contact/) us if
you'd like to discuss support on commission.
## Reminder - Linux CP
[Vector Packet Processing](https://fd.io/) by itself offers _only_ a dataplane implementation, that is
to say it cannot run controlplane software like OSPF, BGP, LDP etc out of the box. However, VPP allows
_plugins_ to offer additional functionalty. Rather than adding the routing protocols as VPP plugins,
I much rather leverage high quality and well supported community efforts like [FRR](https://frrouting.org/)
or [Bird](https://bird.network.cz/).
I wrote a series of in-depth [articles](/s/articles/) explaining in detail the design and implementation,
but for the purposes of this article, I will keep it brief. The Linux Control Plane (LCP) is a set of two
plugins:
1. The ***Interface plugin*** is responsible for taking VPP interfaces (like ethernet, tunnel, bond)
and exposing them in Linux as a TAP device. When configuration such as link MTU, state, MAC
address or IP address are applied in VPP, the plugin will copy this forward into the host
interface representation.
1. The ***Netlink plugin*** is responsible for taking events in Linux (like a user setting an IP address
or route, or the system receiving ARP or IPv6 neighbor request/reply from neighbors), and applying
these events to the VPP dataplane.
I've published the code on [Github](https://github.com/pimvanpelt/lcpng/) and I am targeting a release
in upstream VPP, hoping to make the upcoming 22.02 release in February 2022. I have a lot of ground to
cover, but I will note that the plugin has been running in production in [AS8298]({% post_url 2021-02-27-network %})
since Sep'21 and no crashes related to LinuxCP have been observed.
To help tinkerers, this article describes a KVM disk image in _qcow2_ format, which will boot a vanilla
Debian install and further comes pre-loaded with a fully functioning VPP, LinuxCP and both FRR and Bird
controlplane environment. I'll go into detail on precisely how you can build your own. Of course, you're
welcome to just take the results of this work and download the `qcow2` image above.
### Building the Debian KVM image
In this section I'll try to be precise in the steps I took to create the KVM _qcow2_ image, in case you're
interested in reproducing for yourself. Overall, I find that reading about how folks build images teaches
me a lot about the underlying configurations, and I'm as well keen on remembering how to do it myself,
so this article serves as well as reference documentation for IPng Networks in case we want to build
images in the future.
#### Step 1. Install Debian
For this, I'll use `virt-install` completely on the prompt of my workstation, a Linux machine which
is running Ubuntu Hirsute (21.04). Assuming KVM is installed [ref](https://help.ubuntu.com/community/KVM/Installation)
and already running, let's build a simple Debian Bullseye _qcow2_ bootdisk:
```
pim@chumbucket:~$ sudo apt-get install qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils
pim@chumbucket:~$ sudo apt-get install virtinst
pim@chumbucket:~$ sudo adduser `id -un` libvirt
pim@chumbucket:~$ sudo adduser `id -un` kvm
pim@chumbucket:~$ qemu-img create -f qcow2 vpp-proto.qcow2 8G
pim@chumbucket:~$ virt-install --virt-type kvm --name vpp-proto \
--location http://deb.debian.org/debian/dists/bullseye/main/installer-amd64/ \
--os-variant debian10 \
--disk /home/pim/vpp-proto.qcow2,bus=virtio \
--memory 4096 \
--graphics none \
--network=bridge:mgmt \
--console pty,target_type=serial \
--extra-args "console=ttyS0" \
--check all=off
```
_Note_: You may want to use a different network bridge, commonly `bridge:virbr0`. In my case, the
network which runs DHCP is on a bridge called `mgmt`. And, just for pedantry, it's good to make
yourself a member of groups `kvm` and `libvirt` so that you can run most `virsh` commands as an
unprivileged user.
During the Debian Bullseye install, I try to leave everything as vanilla as possible, but I do
enter the following specifics:
* ***Root Password*** the string `IPng loves VPP`
* ***User*** login `ipng` with password `ipng loves vpp`
* ***Disk*** is entirely in one partition / (all 8GB of it), no swap
* ***Software selection*** remove everything but `SSH server` and `standard system utilities`
When the machine is done installing, it'll reboot and I'll log in as root to install a few packages,
most notably `sudo` which will allow the user `ipng` to act as root. The other seemingly weird
packages are to help the VPP install along later.
```
root@vpp-proto:~# apt install rsync net-tools traceroute snmpd snmp iptables sudo gnupg2 \
curl libmbedcrypto3 libmbedtls12 libmbedx509-0 libnl-3-200 libnl-route-3-200 \
libnuma1 python3-cffi python3-cffi-backend python3-ply python3-pycparser libsubunit0
root@vpp-proto:~# adduser ipng sudo
root@vpp-proto:~# poweroff
```
Finally, after I stop the VM, I'll edit its XML config to give it a few VirtIO NICs to play with,
nicely grouped on the same virtual PCI bus/slot. I look for the existing `<interface>` block that
`virt-install` added for me, and add four new ones under that, all added to a newly created bridge
called `empty`, for now:
```
pim@chumbucket:~$ sudo brctl addbr empty
pim@chumbucket:~$ virsh edit vpp-proto
<interface type='bridge'>
<mac address='52:54:00:13:10:00'/>
<source bridge='empty'/>
<target dev='vpp-e0'/>
<model type='virtio'/>
<mtu size='9000'/>
<address type='pci' domain='0x0000' bus='0x10' slot='0x00' function='0x0' multifunction='on'/>
</interface>
<interface type='bridge'>
<mac address='52:54:00:13:10:01'/>
<source bridge='empty'/>
<target dev='vpp-e1'/>
<model type='virtio'/>
<mtu size='9000'/>
<address type='pci' domain='0x0000' bus='0x10' slot='0x00' function='0x1'/>
</interface>
<interface type='bridge'>
<mac address='52:54:00:13:10:02'/>
<source bridge='empty'/>
<target dev='vpp-e2'/>
<model type='virtio'/>
<mtu size='9000'/>
<address type='pci' domain='0x0000' bus='0x10' slot='0x00' function='0x2'/>
</interface>
<interface type='bridge'>
<mac address='52:54:00:13:10:03'/>
<source bridge='empty'/>
<target dev='vpp-e3'/>
<model type='virtio'/>
<mtu size='9000'/>
<address type='pci' domain='0x0000' bus='0x10' slot='0x00' function='0x3'/>
</interface>
pim@chumbucket:~$ virsh start --console vpp-proto
```
And with that, I have a lovely virtual machine to play with, serial and all, beautiful!
![VPP Proto](/assets/vpp-proto/vpp-proto.png)
#### Step 2. Compile VPP + Linux CP
Compiling DPDK and VPP can both take a while, and to avoid cluttering the virtual machine, I'll do
this step on my buildfarm and copy the resulting Debian packages back onto the VM.
This step simply follows [VPP's doc](https://fdio-vpp.readthedocs.io/en/latest/gettingstarted/developers/building.html)
but to recap the individual steps here, I will:
* use Git to check out both VPP and my plugin
* ensure all Debian dependencies are installed
* build DPDK libraries as a Debian package
* build VPP and its plugins (including LinuxCP)
* finally, build a set of Debian packages out of the VPP, Plugins, DPDK, etc.
The resulting Packages will work both on Debian (Buster and Bullseye) as well as Ubuntu (Focal, 20.04).
So grab a cup of tea, while we let Rhino stretch its legs, ehh, CPUs ...
```
pim@rhino:~$ mkdir -p ~/src
pim@rhino:~$ cd ~/src
pim@rhino:~/src$ sudo apt install libmnl-dev
pim@rhino:~/src$ git clone https://github.com/pimvanpelt/lcpng.git
pim@rhino:~/src$ git clone https://gerrit.fd.io/r/vpp
pim@rhino:~/src$ ln -s ~/src/lcpng ~/src/vpp/src/plugins/lcpng
pim@rhino:~/src$ cd ~/src/vpp
pim@rhino:~/src/vpp$ make install-deps
pim@rhino:~/src/vpp$ make install-ext-deps
pim@rhino:~/src/vpp$ make build-release
pim@rhino:~/src/vpp$ make pkg-deb
```
Which will yield the following Debian packages, would you believe that, at exactly leet-o'clock :-)
```
pim@rhino:~/src/vpp$ ls -hSl build-root/*.deb
-rw-r--r-- 1 pim pim 71M Dec 23 13:37 build-root/vpp-dbg_22.02-rc0~421-ge6387b2b9_amd64.deb
-rw-r--r-- 1 pim pim 4.7M Dec 23 13:37 build-root/vpp_22.02-rc0~421-ge6387b2b9_amd64.deb
-rw-r--r-- 1 pim pim 4.2M Dec 23 13:37 build-root/vpp-plugin-core_22.02-rc0~421-ge6387b2b9_amd64.deb
-rw-r--r-- 1 pim pim 3.7M Dec 23 13:37 build-root/vpp-plugin-dpdk_22.02-rc0~421-ge6387b2b9_amd64.deb
-rw-r--r-- 1 pim pim 1.3M Dec 23 13:37 build-root/vpp-dev_22.02-rc0~421-ge6387b2b9_amd64.deb
-rw-r--r-- 1 pim pim 308K Dec 23 13:37 build-root/vpp-plugin-devtools_22.02-rc0~421-ge6387b2b9_amd64.deb
-rw-r--r-- 1 pim pim 173K Dec 23 13:37 build-root/libvppinfra_22.02-rc0~421-ge6387b2b9_amd64.deb
-rw-r--r-- 1 pim pim 138K Dec 23 13:37 build-root/libvppinfra-dev_22.02-rc0~421-ge6387b2b9_amd64.deb
-rw-r--r-- 1 pim pim 27K Dec 23 13:37 build-root/python3-vpp-api_22.02-rc0~421-ge6387b2b9_amd64.deb
```
I've copied these packages to our `vpp-proto` image in `~ipng/packages/`, where I'll simply install
them using `dpkg`:
```
ipng@vpp-proto:~$ sudo mkdir -p /var/log/vpp
ipng@vpp-proto:~$ sudo dpkg -i ~/packages/*.deb
ipng@vpp-proto:~$ sudo adduser `id -un` vpp
```
I'll configure 2GB of hugepages and 64MB of netlink buffer size - see my [VPP #7]({% post_url 2021-09-21-vpp-7 %})
post for more details and lots of background information:
```
ipng@vpp-proto:~$ cat << EOF | sudo tee /etc/sysctl.d/80-vpp.conf
vm.nr_hugepages=1024
vm.max_map_count=3096
vm.hugetlb_shm_group=0
kernel.shmmax=2147483648
EOF
ipng@vpp-proto:~$ cat << EOF | sudo tee /etc/sysctl.d/81-vpp-netlink.conf
net.core.rmem_default=67108864
net.core.wmem_default=67108864
net.core.rmem_max=67108864
net.core.wmem_max=67108864
EOF
ipng@vpp-proto:~$ sudo sysctl -p -f /etc/sysctl.d/80-vpp.conf
ipng@vpp-proto:~$ sudo sysctl -p -f /etc/sysctl.d/81-vpp-netlink.conf
```
Next, I'll create a network namespace for VPP and associated controlplane software to run in, this is because
VPP will want to create its TUN/TAP devices separate from the _default_ namespace:
```
ipng@vpp-proto:~$ cat << EOF | sudo tee /usr/lib/systemd/system/netns-dataplane.service
[Unit]
Description=Dataplane network namespace
After=systemd-sysctl.service network-pre.target
Before=network.target network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
# PrivateNetwork will create network namespace which can be
# used in JoinsNamespaceOf=.
PrivateNetwork=yes
# To set `ip netns` name for this namespace, we create a second namespace
# with required name, unmount it, and then bind our PrivateNetwork
# namespace to it. After this we can use our PrivateNetwork as a named
# namespace in `ip netns` commands.
ExecStartPre=-/usr/bin/echo "Creating dataplane network namespace"
ExecStart=-/usr/sbin/ip netns delete dataplane
ExecStart=-/usr/bin/mkdir -p /etc/netns/dataplane
ExecStart=-/usr/bin/touch /etc/netns/dataplane/resolv.conf
ExecStart=-/usr/sbin/ip netns add dataplane
ExecStart=-/usr/bin/umount /var/run/netns/dataplane
ExecStart=-/usr/bin/mount --bind /proc/self/ns/net /var/run/netns/dataplane
# Apply default sysctl for dataplane namespace
ExecStart=-/usr/sbin/ip netns exec dataplane /usr/lib/systemd/systemd-sysctl
ExecStop=-/usr/sbin/ip netns delete dataplane
[Install]
WantedBy=multi-user.target
WantedBy=network-online.target
EOF
ipng@vpp-proto:~$ sudo systemctl enable netns-dataplane
ipng@vpp-proto:~$ sudo systemctl start netns-dataplane
```
Finally, I'll add a useful startup configuration for VPP (note the comment on `poll-sleep-usec`
which slows down the DPDK poller, making it a little bit milder on the CPU:
```
ipng@vpp-proto:~$ cd /etc/vpp
ipng@vpp-proto:/etc/vpp$ sudo cp startup.conf startup.conf.orig
ipng@vpp-proto:/etc/vpp$ cat << EOF | sudo tee startup.conf
unix {
nodaemon
log /var/log/vpp/vpp.log
cli-listen /run/vpp/cli.sock
gid vpp
## This makes VPP sleep 1ms between each DPDK poll, greatly
## reducing CPU usage, at the expense of latency/throughput.
poll-sleep-usec 1000
## Execute all CLI commands from this file upon startup
exec /etc/vpp/bootstrap.vpp
}
api-trace { on }
api-segment { gid vpp }
socksvr { default }
memory {
main-heap-size 512M
main-heap-page-size default-hugepage
}
buffers {
buffers-per-numa 128000
default data-size 2048
page-size default-hugepage
}
statseg {
size 1G
page-size default-hugepage
per-node-counters off
}
plugins {
plugin lcpng_nl_plugin.so { enable }
plugin lcpng_if_plugin.so { enable }
}
logging {
default-log-level info
default-syslog-log-level notice
class linux-cp/if { rate-limit 10000 level debug syslog-level debug }
class linux-cp/nl { rate-limit 10000 level debug syslog-level debug }
}
lcpng {
default netns dataplane
lcp-sync
lcp-auto-subint
}
EOF
ipng@vpp-proto:/etc/vpp$ cat << EOF | sudo tee bootstrap.vpp
comment { Create a loopback interface }
create loopback interface instance 0
lcp create loop0 host-if loop0
set interface state loop0 up
set interface ip address loop0 2001:db8::1/64
set interface ip address loop0 192.0.2.1/24
comment { Create Linux Control Plane interfaces }
lcp create GigabitEthernet10/0/0 host-if e0
lcp create GigabitEthernet10/0/1 host-if e1
lcp create GigabitEthernet10/0/2 host-if e2
lcp create GigabitEthernet10/0/3 host-if e3
EOF
ipng@vpp-proto:/etc/vpp$ sudo systemctl restart vpp
```
After all of this, the following screenshot is a reasonable confirmation of success.
![VPP Interfaces + LCP](/assets/vpp-proto/vppctl-ip-link.png)
#### Step 3. Install / Configure FRR
Debian Bullseye ships with FRR 7.5.1, which will be fine. But for completeness, I'll point out that
FRR maintains their own Debian package [repo](https://deb.frrouting.org/) as well, and they're currently
releasing FRR 8.1 as stable, so I opt to install that one instead:
```
ipng@vpp-proto:~$ curl -s https://deb.frrouting.org/frr/keys.asc | sudo apt-key add -
ipng@vpp-proto:~$ FRRVER="frr-stable"
ipng@vpp-proto:~$ echo deb https://deb.frrouting.org/frr $(lsb_release -s -c) $FRRVER | \
sudo tee -a /etc/apt/sources.list.d/frr.list
ipng@vpp-proto:~$ sudo apt update && sudo apt install frr frr-pythontools
ipng@vpp-proto:~$ sudo adduser `id -un` frr
ipng@vpp-proto:~$ sudo adduser `id -un` frrvty
```
After installing, FRR will start up in the _default_ network namespace, but I'm going to be using
VPP in a custom namespace called `dataplane`. FRR after version 7.5 can work with multiple namespaces
[ref](http://docs.frrouting.org/en/stable-8.1/setup.html?highlight=pathspace%20netns#network-namespaces)
which boils down to adding the following `daemons` file:
```
ipng@vpp-proto:~$ cat << EOF | sudo tee /etc/frr/daemons
bgpd=yes
ospfd=yes
ospf6d=yes
bfdd=yes
vtysh_enable=yes
watchfrr_options="--netns=dataplane"
zebra_options=" -A 127.0.0.1 -s 67108864"
bgpd_options=" -A 127.0.0.1"
ospfd_options=" -A 127.0.0.1"
ospf6d_options=" -A ::1"
staticd_options="-A 127.0.0.1"
bfdd_options=" -A 127.0.0.1"
EOF
ipng@vpp-proto:~$ sudo systemctl restart frr
```
After restarting FRR with this _namespace_ aware configuration, I can check to ensure it found
the `loop0` and `e0-3` interfaces VPP defined above. Let's take a look, while I set link `e0`
up and give it an IPv4 address. I'll do this in the `dataplane` namespace, and expect that FRR
picks this up as it's monitoring the netlink messages in that namespace as well:
![VPP VtySH](/assets/vpp-proto/vpp-frr.png)
#### Step 4. Install / Configure Bird2
Installing Bird2 is straight forward, although as with FRR above, after installing it'll want to
run in the _default_ namespace, which we ought to change. And as well, let's give it a bit of a
default configuration to get started:
```
ipng@vpp-proto:~$ sudo apt-get install bird2
ipng@vpp-proto:~$ sudo systemctl stop bird
ipng@vpp-proto:~$ sudo systemctl disable bird
ipng@vpp-proto:~$ sudo systemctl mask bird
ipng@vpp-proto:~$ sudo adduser `id -un` bird
```
Then, I create a systemd unit for Bird running in the dataplane:
```
ipng@vpp-proto:~$ sed -e 's,ExecStart=,ExecStart=/usr/sbin/ip netns exec dataplane ,' < \
/usr/lib/systemd/system/bird.service | sudo tee /usr/lib/systemd/system/bird-dataplane.service
ipng@vpp-proto:~$ sudo systemctl enable bird-dataplane
```
And, finally, I create some reasonable default config and start bird in the dataplane namespace:
```
ipng@vpp-proto:~$ cd /etc/bird
ipng@vpp-proto:/etc/bird$ sudo cp bird.conf bird.conf.orig
ipng@vpp-proto:/etc/bird$ cat << EOF | sudo tee bird.conf
router id 192.0.2.1;
protocol device { scan time 30; }
protocol direct { ipv4; ipv6; check link yes; }
protocol kernel kernel4 {
ipv4 { import none; export where source != RTS_DEVICE; };
learn off;
scan time 300;
}
protocol kernel kernel6 {
ipv6 { import none; export where source != RTS_DEVICE; };
learn off;
scan time 300;
}
EOF
ipng@vpp-proto:/usr/lib/systemd/system$ sudo systemctl start bird-dataplane
```
And the results work quite similar to FRR, due to the VPP plugins working via Netlink,
basically any program that operates in the _dataplane_ namespace can interact with the
kernel TAP interfaces, create/remove links, set state and MTU, add/remove IP addresses
and routes:
![VPP Bird2](/assets/vpp-proto/vpp-bird.png)
### Choosing FRR or Bird
At IPng Networks, we have historically, and continue to use Bird as our routing system
of choice. But I totally realize the potential of FRR, in fact its implementation of LDP
is what may drive me onto the platform after all, as I'd love to add MPLS support to the
LinuxCP plugin at some point :-)
By default the KVM image comes with **both FRR and Bird enabled**. This is OK because there
is no configuration on them yet, and they won't be in each others' way. It makes sense for
users of the image to make a conscious choice which of the two they'd like to use, and simply
disable and mask the other one:
#### If FRR is your preference:
```
ipng@vpp-proto:~$ sudo systemctl stop bird-dataplane
ipng@vpp-proto:~$ sudo systemctl disable bird-dataplane
ipng@vpp-proto:~$ sudo systemctl mask bird-dataplane
ipng@vpp-proto:~$ sudo systemctl unmask frr
ipng@vpp-proto:~$ sudo systemctl enable frr
ipng@vpp-proto:~$ sudo systemctl start frr
```
#### If Bird is your preference:
```
ipng@vpp-proto:~$ sudo systemctl stop frr
ipng@vpp-proto:~$ sudo systemctl disable frr
ipng@vpp-proto:~$ sudo systemctl mask frr
ipng@vpp-proto:~$ sudo systemctl unmask bird-dataplane
ipng@vpp-proto:~$ sudo systemctl enable bird-dataplane
ipng@vpp-proto:~$ sudo systemctl start bird-dataplane
```
And with that, I hope to have given you a good overview of what comes into play when
installing a Debian machine with VPP, my LinuxCP plugin, and FRR or Bird: Happy hacking!
### One last thing ..
After I created the KVM image, I made a qcow2 snapshot of it in pristine state. This means
you can mess around with the VM, and easily revert to that pristine state without having
to download the image again. You can also add some customization (as I've done for our own
VPP Lab at IPng Networks) and set another snapshot and roll forwards and backwards between
them. The syntax is:
```
## Create a named snapshot
pim@chumbucket:~$ qemu-img snapshot -c pristine vpp-proto.qcow2
## List snapshots in the image
pim@chumbucket:~$ qemu-img snapshot -l vpp-proto.qcow2
Snapshot list:
ID TAG VM SIZE DATE VM CLOCK ICOUNT
1 pristine 0 B 2021-12-23 17:52:36 00:00:00.000 0
## Revert to the named snapshot
pim@chumbucket:~$ qemu-img snapshot -a pristine vpp-proto.qcow2
## Delete the named snapshot
pim@chumbucket:~$ qemu-img snapshot -d pristine vpp-proto.qcow2
```

View File

@ -0,0 +1,682 @@
---
date: "2022-01-12T18:35:14Z"
title: Case Study - Virtual Leased Line (VLL) in VPP
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
After completing the Linux CP plugin, interfaces and their attributes such as addresses and routes
can be shared between VPP and the Linux kernel in a clever way, so running software like
[FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
&gt;100Mpps and &gt;100Gbps forwarding rates are easily in reach!
If you've read my previous articles (thank you!), you will have noticed that I have done a lot of
work on making VPP work well in an ISP (BGP/OSPF) environment with Linux CP. However, there's many other cool
things about VPP that make it a very competent advanced services router. One that that has always been
super interesting to me, is being able to offer L2 connectivity over wide-area-network. For example, a
virtual leased line from our Colo in Zurich to Amsterdam NIKHEF. This article explores this space.
***NOTE***: If you're only interested in results, scroll all the way down to the markdown table and
graph for performance stats.
## Introduction
ISPs can offer ethernet services, often called _Virtual Leased Lines_ (VLLs), _Layer2 VPN_ (L2VPN)
or _Ethernet Backhaul_. They mean the same thing: imagine a switchport in location A that appears
to be transparently and directly connected to a switchport in location B, with the ISP (layer3, so
IPv4 and IPv6) network in between. The "simple" old-school setup would be to have switches which
define VLANs and are all interconnected. But we collectively learned that it's a bad idea for several
reasons:
* Large broadcast domains tend to encouter L2 forwarding loops sooner rather than later
* Spanning-Tree and its kin are a stopgap, but they often disable an entire port from forwarding,
which can be expensive if that port is connected to a dark fiber into another datacenter far
away.
* Large VLAN setups that are intended to interconnect with other operators run into overlapping
VLAN tags, which means switches have to do tag rewriting and filtering and such.
* Traffic engineering is all but non-existent in L2-only networking domains, while L3 has all sorts
of smart TE extensions, ECMP, and so on.
The canonical solution is for ISPs to encapsulate the ethernet traffic of their customers in some
tunneling mechanism, for example in MPLS or in some IP tunneling protocol. Fundamentally, these are
the same, except for the chosen protocol and overhead/cost of forwarding. MPLS is a very thin layer
under the packet, but other IP based tunneling mechanisms exist, commonly used are GRE, VXLAN and GENEVE
but many others exist.
They all work roughly the same:
* An IP packet has a _maximum transmission unit_ (MTU) of 1500 bytes, while the ethernet header is
typically an additional 14 bytes: a 6 byte source MAC, 6 byte destination MAC, and 2 byte ethernet
type, which is 0x0800 for an IPv4 datagram, 0x0806 for ARP, and 0x86dd for IPv6, and many others
[[ref](https://en.wikipedia.org/wiki/EtherType)].
* If VLANs are used, an additional 4 bytes are needed [[ref](https://en.wikipedia.org/wiki/IEEE_802.1Q)]
making the ethernet frame at most 1518 bytes long, with an ethertype of 0x8100.
* If QinQ or QinAD are used, yet again 4 bytes are needed [[ref](https://en.wikipedia.org/wiki/IEEE_802.1ad)],
making the ethernet frame at most 1522 bytes long, with an ethertype of either 0x8100 or 0x9100,
depending on the implementation.
* We can take such an ethernet frame, and make it the _payload_ of another IP packet, encapsulating the original
ethernet frame in a new IPv4 or IPv6 packet. We can then route it over an IP network to a remote
site.
* Upon receipt of such a packet, by looking at the headers the remote router can determine that this
packet represents an encapsulated ethernet frame, unpack it all, and forward the original frame onto a
given interface.
### IP Tunneling Protocols
First let's get some theory out of the way -- I'll discuss three common IP tunneling protocols here, and then
move on to demonstrate how they are configured in VPP and perhaps more importantly, how they perform in VPP.
Each tunneling protocol has its own advantages and disadvantages, but I'll stick to the basics first:
#### GRE: Generic Routing Encapsulation
_Generic Routing Encapsulation_ (GRE, described in [RFC2784](https://datatracker.ietf.org/doc/html/rfc2784)) is a
very old and well known tunneling protocol. The packet is an IP datagram with protocol number 47, consisting
of a header with 4 bits of flags, 8 reserved bits, 3 bits for the version (normally set to all-zeros), and
16 bits for the inner protocol (ether)type, so 0x0800 for IPv4, 0x8100 for 802.1q and so on. It's a very
small header of only 4 bytes and an optional key (4 bytes) and sequence number (also 4 bytes) whieah means
that to be able to transport any ethernet frame (including the fancy QinQ and QinAD ones), the _underlay_
must have an end to end MTU of at least 1522 + 20(IPv4)+12(GRE) = ***1554 bytes for IPv4*** and ***1574
bytes for IPv6***.
#### VXLAN: Virtual Extensible LAN
_Virtual Extensible LAN_ (VXLAN, described in [RFC7348](https://datatracker.ietf.org/doc/html/rfc7348))
is a UDP datagram which has a header consisting of 8 bits worth
of flags, 24 bits reserved for future expansion, 24 bits of _Virtual Network Identifier_ (VNI) and
an additional 8 bits or reserved space at the end. It uses UDP port 4789 as assigned by IANA. VXLAN
encapsulation adds 20(IPv4)+8(UDP)+8(VXLAN) = 36 bytes, and considering IPv6 is 40 bytes, there it
adds 56 bytes. This means that to be able to transport any ethernet frame, the _underlay_
network must have an end to end MTU of at least 1522+36 = ***1558 bytes for IPv4*** and ***1578 bytes
for IPv6***.
#### GENEVE: Generic Network Virtualization Encapsulation
_GEneric NEtwork Virtualization Encapsulation_ (GENEVE, described in [RFC8926](https://datatracker.ietf.org/doc/html/rfc8926))
is somewhat similar to VXLAN although it was an attempt to stop the wild growth of tunneling protocols,
I'm sure there is an [XKCD](https://xkcd.com/927/) out there specifically for this approach. The packet is
also a UDP datagram with destination port 6081, followed by an 8 byte GENEVE specific header, containing
2 bits of version, 8 bits for flags, a 16 bit inner ethertype, a 24 bit _Virtual Network Identifier_ (VNI),
and 8 bits of reserved space. With GENEVE, several options are available and will be tacked onto the GENEVE
header, but they are typically not used. If they are though, the options can add an additional 16 bytes
which means that to be able to transport any ethernet frame, the _underlay_ network must have an end to
end MTU of at least 1522+52 = ***1574 bytes for IPv4*** and ***1594 bytes for IPv6***.
### Hardware setup
First let's take a look at the physical setup. I'm using three servers and a switch in the IPng Networks lab:
{{< image width="400px" float="right" src="/assets/vpp/l2-xconnect-lab.png" alt="Loadtest Setup" >}}
* `hvn0`: Dell R720xd, load generator
* Dual E5-2620, 24 CPUs, 2 threads per core, 2 numa nodes
* 64GB DDR3 at 1333MT/s
* Intel X710 4-port 10G, Speed 8.0GT/s Width x8 (64 Gbit/s)
* `Hippo` and `Rhino`: VPP routers
* ASRock B550 Taichi
* Ryzen 5950X 32 CPUs, 2 threads per core, 1 numa node
* 64GB DDR4 at 2133 MT/s
* Intel 810C 2-port 100G, Speed 16.0 GT/s Width x16 (256 Gbit/s)
* `fsw0`: FS.com switch S5860-48SC, 8x 100G, 48x 10G
* VLAN 4 (blue) connects Rhino's `Hu12/0/1` to Hippo's `Hu12/0/1`
* VLAN 5 (red) connects hvn0's `enp5s0f0` to Rhino's `Hu12/0/0`
* VLAN 6 (green) connects hvn0's `enp5s0f1` to Hippo's `Hu12/0/0`
* All switchports have jumbo frames enabled and are set to 9216 bytes.
Further, Hippo and Rhino are running VPP at head `vpp v22.02-rc0~490-gde3648db0`, and hvn0 is running
T-Rex v2.93 in L2 mode, with MAC address `00:00:00:01:01:00` on the first port, and MAC address
`00:00:00:02:01:00` on the second port. This machine can saturate 10G in both directions with small
packets even when using only one flow, as can be seen, if the ports are just looped back onto one
another, for example by physically crossconnecting them with an SFP+ or DAC; or in my case by putting
`fsw0` port `Te0/1` and `Te0/2` in the same VLAN together:
{{< image width="800px" src="/assets/vpp/l2-xconnect-trex.png" alt="TRex on hvn0" >}}
Now that I have shared all the context and hardware, I'm ready to actually dive in to what I wanted to
talk about: how does all this _virtual leased line_ business look like, for VPP. Ready? Here we go!
### Direct L2 CrossConnect
The simplest thing I can show in VPP, is to configure a layer2 cross-connect (_l2 xconnect_) between
two ports. In this case, VPP doesn't even need to have an IP address, all I do is bring up the ports,
set their MTU to be able to carry the 1522 bytes frames (ethernet at 1514, dot1q at 1518, and QinQ
at 1522 bytes). The configuration is identical on both Rhino and Hippo:
```
set interface state HundredGigabitEthernet12/0/0 up
set interface state HundredGigabitEthernet12/0/1 up
set interface mtu packet 1522 HundredGigabitEthernet12/0/0
set interface mtu packet 1522 HundredGigabitEthernet12/0/1
set interface l2 xconnect HundredGigabitEthernet12/0/0 HundredGigabitEthernet12/0/1
set interface l2 xconnect HundredGigabitEthernet12/0/1 HundredGigabitEthernet12/0/0
```
I'd say the only thing to keep in mind here, is that the cross-connect commands only
link in one direction (receive in A, forward to B), and that's why I have to type them twice (receive in B,
forward to A). Of course, this must be really cheap on VPP -- because all it has to do now is receive
from DPDK and immediately schedule for transmit on the other port. Looking at `show runtime` I can
see how much CPU time is spent in each of VPP's nodes:
```
Time 1241.5, 10 sec internal node vector rate 28.70 loops/sec 475009.85
vector rates in 1.4879e7, out 1.4879e7, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/1-o 650727833 18472218801 7.49e0 28.39
HundredGigabitEthernet12/0/1-t 650727833 18472218801 4.12e1 28.39
ethernet-input 650727833 18472218801 5.55e1 28.39
l2-input 650727833 18472218801 1.52e1 28.39
l2-output 650727833 18472218801 1.32e1 28.39
```
In this simple cross connect mode, the only thing VPP has to do is receive the ethernet, funnel it
into `l2-input`, and immediately send it straight through `l2-output` back out, which does not cost
much in terms of CPU cycles at all. In total, this CPU thread is forwarding 14.88Mpps (line rate 10G
at 64 bytes), at an average of 133 cycles per packet (not counting the time spent in DPDK). The CPU
has room to spare in this mode, in other words even _one CPU thread_ can handle this workload at
line rate, impressive!
Although cool, doing an L2 crossconnect like this isn't super useful. Usually, the customer leased line
has to be transported to another location, and for that we'll need some form of encapsulation ...
### Crossconnect over IPv6 VXLAN
Let's start with VXLAN. The concept is pretty straight forward in VPP. Based on the configuration
I put in Rhino and Hippo above, I first will have to bring `Hu12/0/1` out of L2 mode, give both interfaces an
IPv6 address, create a tunnel with a given _VNI_, and then crossconnect the customer side `Hu12/0/0`
into the `vxlan_tunnel0` and vice-versa. Piece of cake:
```
## On Rhino
set interface mtu packet 1600 HundredGigabitEthernet12/0/1
set interface l3 HundredGigabitEthernet12/0/1
set interface ip address HundredGigabitEthernet12/0/1 2001:db8::1/64
create vxlan tunnel instance 0 src 2001:db8::1 dst 2001:db8::2 vni 8298
set interface state vxlan_tunnel0 up
set interface mtu packet 1522 vxlan_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0
## On Hippo
set interface mtu packet 1600 HundredGigabitEthernet12/0/1
set interface l3 HundredGigabitEthernet12/0/1
set interface ip address HundredGigabitEthernet12/0/1 2001:db8::2/64
create vxlan tunnel instance 0 src 2001:db8::2 dst 2001:db8::1 vni 8298
set interface state vxlan_tunnel0 up
set interface mtu packet 1522 vxlan_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0
```
Of course, now we're actually beginning to make VPP do some work, and the exciting thing is, if there
would be an (opaque) ISP network between Rhino and Hippo, this would work just fine considering the
encapsulation is 'just' IPv6 UDP. Under the covers, for each received frame, VPP has to encapsulate it
into VXLAN, and route the resulting L3 packet by doing an IPv6 routing table lookup:
```
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 32132.74
vector rates in 8.5423e6, out 8.5423e6, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/0-o 333777 85445944 2.74e0 255.99
HundredGigabitEthernet12/0/0-t 333777 85445944 5.28e1 255.99
ethernet-input 333777 85445944 4.25e1 255.99
ip6-input 333777 85445944 1.25e1 255.99
ip6-lookup 333777 85445944 2.41e1 255.99
ip6-receive 333777 85445944 1.71e2 255.99
ip6-udp-lookup 333777 85445944 1.55e1 255.99
l2-input 333777 85445944 8.94e0 255.99
l2-output 333777 85445944 4.44e0 255.99
vxlan6-input 333777 85445944 2.12e1 255.99
```
I can definitely see a lot more action here. In this mode, VPP is handlnig 8.54Mpps on this CPU thread
before saturating. At full load, VPP is spending 356 CPU cycles per packet, of which almost half is in
node `ip6-receive`.
### Crossconnect over IPv4 VXLAN
Seeing `ip6-receive` being such a big part of the cost (almost half!), I wonder what it might look like if
I change the tunnel to use IPv4. So I'll give Rhino and Hippo an IPv4 address as well, delete the vxlan tunnel I made
before (the IPv6 one), and create a new one with IPv4:
```
set interface ip address HundredGigabitEthernet12/0/1 10.0.0.0/31
create vxlan tunnel instance 0 src 2001:db8::1 dst 2001:db8::2 vni 8298 del
create vxlan tunnel instance 0 src 10.0.0.0 dst 10.0.0.1 vni 8298
set interface state vxlan_tunnel0 up
set interface mtu packet 1522 vxlan_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0
set interface ip address HundredGigabitEthernet12/0/1 10.0.0.1/31
create vxlan tunnel instance 0 src 2001:db8::2 dst 2001:db8::1 vni 8298 del
create vxlan tunnel instance 0 src 10.0.0.1 dst 10.0.0.0 vni 8298
set interface state vxlan_tunnel0 up
set interface mtu packet 1522 vxlan_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0
```
And after letting this run for a few seconds, I can take a look and see how the `ip4-*` version of
the VPP code performs:
```
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 53309.71
vector rates in 1.4151e7, out 1.4151e7, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/0-o 552890 141539600 2.76e0 255.99
HundredGigabitEthernet12/0/0-t 552890 141539600 5.30e1 255.99
ethernet-input 552890 141539600 4.13e1 255.99
ip4-input-no-checksum 552890 141539600 1.18e1 255.99
ip4-lookup 552890 141539600 1.68e1 255.99
ip4-receive 552890 141539600 2.74e1 255.99
ip4-udp-lookup 552890 141539600 1.79e1 255.99
l2-input 552890 141539600 8.68e0 255.99
l2-output 552890 141539600 4.41e0 255.99
vxlan4-input 552890 141539600 1.76e1 255.99
```
Throughput is now quite a bit higher, clocking a cool 14.2Mpps (just short of line rate!) at 202 CPU
cycles per packet, considerably less time spent than in IPv6, but keep in mind that VPP has an ~empty
routing table in all of these tests.
### Crossconnect over IPv6 GENEVE
Another popular cross connect type, also based on IPv4 and IPv6 UDP packets, is GENEVE. The configuration
is almost identical, so I delete the IPv4 VXLAN and create an IPv6 GENEVE tunnel instead:
```
create vxlan tunnel instance 0 src 10.0.0.0 dst 10.0.0.1 vni 8298 del
create geneve tunnel local 2001:db8::1 remote 2001:db8::2 vni 8298
set interface state geneve_tunnel0 up
set interface mtu packet 1522 geneve_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0
create vxlan tunnel instance 0 src 10.0.0.1 dst 10.0.0.0 vni 8298 del
create geneve tunnel local 2001:db8::2 remote 2001:db8::1 vni 8298
set interface state geneve_tunnel0 up
set interface mtu packet 1522 geneve_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0
```
All the while, the TRex on the customer machine `hvn0`, is sending 14.88Mpps in both directions, and
after just a short (second or so) interruption, the GENEVE tunnel comes up, cross-connects into the
customer `Hu12/0/0` interfaces, and starts to carry traffic:
```
Thread 8 vpp_wk_7 (lcore 8)
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 29688.03
vector rates in 8.3179e6, out 8.3179e6, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/0-o 324981 83194664 2.74e0 255.99
HundredGigabitEthernet12/0/0-t 324981 83194664 5.18e1 255.99
ethernet-input 324981 83194664 4.26e1 255.99
geneve6-input 324981 83194664 3.87e1 255.99
ip6-input 324981 83194664 1.22e1 255.99
ip6-lookup 324981 83194664 2.39e1 255.99
ip6-receive 324981 83194664 1.67e2 255.99
ip6-udp-lookup 324981 83194664 1.54e1 255.99
l2-input 324981 83194664 9.28e0 255.99
l2-output 324981 83194664 4.47e0 255.99
```
Similar to VXLAN when using IPv6 the total for GENEVE-v6 is also comparatively slow (I say comparatively
because you should not expect anything like this performance when using Linux or BSD kernel routing!).
The lower throughput is again due to the `ip6-receive` node being costly. It is just slightly worse of a
performer at 8.32Mpps per core and 368 CPU cycles per packet.
### Crossconnect over IPv4 GENEVE
I am now suspecting that GENEVE over IPv4 would have similar gains to when I switched from VXLAN
IPv6 to IPv4 above. So I remove the IPv6 tunnel, create a new IPv4 tunnel instead, and hook it
back up to the customer port on both Rhino and Hippo, like so:
```
create geneve tunnel local 2001:db8::1 remote 2001:db8::2 vni 8298 del
create geneve tunnel local 10.0.0.0 remote 10.0.0.1 vni 8298
set interface state geneve_tunnel0 up
set interface mtu packet 1522 geneve_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0
create geneve tunnel local 2001:db8::2 remote 2001:db8::1 vni 8298 del
create geneve tunnel local 10.0.0.1 remote 10.0.0.0 vni 8298
set interface state geneve_tunnel0 up
set interface mtu packet 1522 geneve_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0
```
And the results, indeed a significant improvement:
```
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 48639.97
vector rates in 1.3737e7, out 1.3737e7, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/0-o 536763 137409904 2.76e0 255.99
HundredGigabitEthernet12/0/0-t 536763 137409904 5.19e1 255.99
ethernet-input 536763 137409904 4.19e1 255.99
geneve4-input 536763 137409904 2.39e1 255.99
ip4-input-no-checksum 536763 137409904 1.18e1 255.99
ip4-lookup 536763 137409904 1.69e1 255.99
ip4-receive 536763 137409904 2.71e1 255.99
ip4-udp-lookup 536763 137409904 1.79e1 255.99
l2-input 536763 137409904 8.81e0 255.99
l2-output 536763 137409904 4.47e0 255.99
```
So, close to line rate again! Performance of GENEVE-v4 clocks in at 13.7Mpps per core or 207
CPU cycles per packet.
## Crossconnect over IPv6 GRE
Now I can't help but wonder, that if those `ip4|6-udp-lookup` nodes burn valuable CPU cycles,
GRE will possibly do better, because it's an L3 protocol (proto number 47) and will never have to
inspect beyond the IP header, so I delete the GENEVE tunnel and give GRE a go too:
```
create geneve tunnel local 10.0.0.0 remote 10.0.0.1 vni 8298 del
create gre tunnel src 2001:db8::1 dst 2001:db8::2 teb
set interface state gre0 up
set interface mtu packet 1522 gre0
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0
create geneve tunnel local 10.0.0.1 remote 10.0.0.0 vni 8298 del
create gre tunnel src 2001:db8::2 dst 2001:db8::1 teb
set interface state gre0 up
set interface mtu packet 1522 gre0
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0
```
Results:
```
Time 10.0, 10 sec internal node vector rate 255.99 loops/sec 37129.87
vector rates in 9.9254e6, out 9.9254e6, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/0-o 387881 99297464 2.80e0 255.99
HundredGigabitEthernet12/0/0-t 387881 99297464 5.21e1 255.99
ethernet-input 775762 198594928 5.97e1 255.99
gre6-input 387881 99297464 2.81e1 255.99
ip6-input 387881 99297464 1.21e1 255.99
ip6-lookup 387881 99297464 2.39e1 255.99
ip6-receive 387881 99297464 5.09e1 255.99
l2-input 387881 99297464 9.35e0 255.99
l2-output 387881 99297464 4.40e0 255.99
```
The performance of GRE-v6 (in transparent ethernet bridge aka _TEB_ mode) is 9.9Mpps per core or
243 CPU cycles per packet, and I'll also note that while the `ip6-receive` node in all the
UDP based tunneling were in the 170 clocks/packet arena, now we're down to only 51 or so, so
indeed a huge improvement.
### Crossconnect over IPv4 GRE
To round off the set, I'll remove the IPv6 GRE tunnel and put an IPv4 GRE tunnel in place:
```
create gre tunnel src 2001:db8::1 dst 2001:db8::2 teb del
create gre tunnel src 10.0.0.0 dst 10.0.0.1 teb
set interface state gre0 up
set interface mtu packet 1522 gre0
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0
create gre tunnel src 2001:db8::2 dst 2001:db8::1 teb del
create gre tunnel src 10.0.0.1 dst 10.0.0.0 teb
set interface state gre0 up
set interface mtu packet 1522 gre0
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0
```
And without further ado:
```
Time 10.0, 10 sec internal node vector rate 255.87 loops/sec 52898.61
vector rates in 1.4039e7, out 1.4039e7, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/0-o 548684 140435080 2.80e0 255.95
HundredGigabitEthernet12/0/0-t 548684 140435080 5.22e1 255.95
ethernet-input 1097368 280870160 2.92e1 255.95
gre4-input 548684 140435080 2.51e1 255.95
ip4-input-no-checksum 548684 140435080 1.19e1 255.95
ip4-lookup 548684 140435080 1.68e1 255.95
ip4-receive 548684 140435080 2.03e1 255.95
l2-input 548684 140435080 8.72e0 255.95
l2-output 548684 140435080 4.43e0 255.95
```
The performance of GRE-v4 (in transparent ethernet bridge aka _TEB_ mode) is 14.0Mpps per
core or 171 CPU cycles per packet. This is really very low, the best of all the tunneling
protocols, but (for obvious reasons) will not outperform a direct L2 crossconnect, as that
cuts out the L3 (and L4) middleperson entirely. Whohoo!
## Conclusions
First, let me give a recap of the tests I did, from left to right the better to worse
performer.
Test | L2XC | GRE-v4 | VXLAN-v4 | GENEVE-v4 | GRE-v6 | VXLAN-v6 | GENEVE-v6
------------- | ----------- | --------- | --------- | ---------- | ---------- | --------- | ---------
pps/core | >14.88M | 14.34M | 14.15M | 13.74M | 9.93M | 8.54M | 8.32M
cycles/packet | 132.59 | 171.45 | 201.65 | 207.44 | 243.35 | 355.72 | 368.09
***(!)*** Achtung! Because in the L2XC mode the CPU was not fully consumed (VPP was consuming only
~28 frames per vector), it did not yet achieve its optimum CPU performance. Under full load, the
cycles/packet will be somewhat lower than what is shown here.
Taking a closer look at the VPP nodes in use, below I draw a graph of CPU cycles spent in each VPP
node, for each type of cross connect, where the lower the stack is, the faster cross connect will
be:
{{< image width="1000px" src="/assets/vpp/l2-xconnect-cycles.png" alt="Cycles by node" >}}
Although clearly GREv4 is the winner, I still would not use it for the following reason:
VPP does not support GRE keys, and considering it is an L3 protocol, I will have to use unique
IPv4 or IPv6 addresses for each tunnel src/dst pair, otherwise VPP will not know upon receipt of
a GRE packet, which tunnel it belongs to. For IPv6 this is not a huge deal (I can bind a whole
/64 to a loopback and just be done with it), but GREv6 does not perform as well as VXLAN-v4 or
GENEVE-v4.
VXLAN and GENEVE are equal performers, both in IPv4 and in IPv6. In both cases, IPv4 is
significantly faster than IPv6. But due to the use of _VNI_ fields in the header, contrary
to GRE, both VXLAN and GENEVE can have the same src/dst IP for any number of tunnels, which
is a huge benefit.
#### Multithreading
Usually, the customer facing side is an ethernet port (or sub-interface with tag popping) that will be
receiving IPv4 or IPv6 traffic (either tagged or untagged) and this allows the NIC to use _RSS_ to assign
this inbound traffic to multiple queues, and thus multiple CPU threads. That's great, it means linear
encapsulation performance.
Once the traffic is encapsulated, it risks becoming single flow with respect to the remote host, if
Rhino would be sending from 10.0.0.0:4789 to Hippo's 10.0.0.1:4789. However, the VPP VXLAN and GENEVE
implementation both inspect the _inner_ payload, and uses it to scramble the source port (thanks to
Neale for pointing this out, it's in `vxlan/encap.c:246`). Deterministically changing the source port
based on the inner-flow will allow Hippo to use _RSS_ on the receiving end, which allows these tunneling
protocols to scale linearly. I proved this for myself by attaching a port-mirror to the switch and
copying all traffic between Hippo and Rhino to a spare machine in the rack:
```
pim@hvn1:~$ sudo tcpdump -ni enp5s0f3 port 4789
11:19:54.887763 IP 10.0.0.1.4452 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.888283 IP 10.0.0.1.42537 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.888285 IP 10.0.0.0.17895 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.899353 IP 10.0.0.1.40751 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.899355 IP 10.0.0.0.35475 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.904642 IP 10.0.0.0.60633 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298
pim@hvn1:~$ sudo tcpdump -ni enp5s0f3 port 6081
11:22:55.802406 IP 10.0.0.0.32299 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.802409 IP 10.0.0.1.44011 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.807711 IP 10.0.0.1.45503 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.807712 IP 10.0.0.0.45532 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.841495 IP 10.0.0.0.61694 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.851719 IP 10.0.0.1.47581 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a:
```
Considering I was sending the T-Rex profile `bench.py` with tunables `vm=var2,size=64`, the latter
of which chooses randomized source and destination (inner) IP addresses in the loadtester, I can
conclude that the outer source port is chosen based on a hash of the inner packet. Slick!!
#### Final conclusion
The most important practical conclusion to draw is that I can feel safe to offer L2VPN services at
IPng Networks using VPP and a VXLAN or GENEVE IPv4 underlay -- our backbone is 9000 bytes everywhere,
so it will be possible to provide up to 8942 bytes of customer payload taking into account the
VXLAN-v4 overhead. At least gigabit symmetric _VLLs_ filled with 64b packets will not be a
problem for the routers we have, as they forward approximately 10.2Mpps per core and 35Mpps
per chassis when fully loaded. Even considering the overhead and CPU consumption that VXLAN
encap/decap brings with it, due to the use of multiple transmit and receive threads,
the router would have plenty of room to spare.
## Appendix
The backing data for the graph in this article are captured in this [Google Sheet](https://docs.google.com/spreadsheets/d/1WZ4xvO1pAjCswpCDC9GfOGIkogS81ZES74_scHQryb0/edit?usp=sharing).
### VPP Configuration
For completeness, the `startup.conf` used on both Rhino and Hippo:
```
unix {
nodaemon
log /var/log/vpp/vpp.log
full-coredump
cli-listen /run/vpp/cli.sock
cli-prompt rhino#
gid vpp
}
api-trace { on }
api-segment { gid vpp }
socksvr { default }
memory {
main-heap-size 1536M
main-heap-page-size default-hugepage
}
cpu {
main-core 0
corelist-workers 1-15
}
buffers {
buffers-per-numa 300000
default data-size 2048
page-size default-hugepage
}
statseg {
size 1G
page-size default-hugepage
per-node-counters off
}
dpdk {
dev default {
num-rx-queues 7
}
decimal-interface-names
dev 0000:0c:00.0
dev 0000:0c:00.1
}
plugins {
plugin lcpng_nl_plugin.so { enable }
plugin lcpng_if_plugin.so { enable }
}
logging {
default-log-level info
default-syslog-log-level crit
class linux-cp/if { rate-limit 10000 level debug syslog-level debug }
class linux-cp/nl { rate-limit 10000 level debug syslog-level debug }
}
lcpng {
default netns dataplane
lcp-sync
lcp-auto-subint
}
```
### Other Details
For posterity, some other stats on the VPP deployment. First of all, a confirmation that PCIe 4.0 x16
slots were used, and that the _Comms_ DDP was loaded:
```
[ 0.433903] pci 0000:0c:00.0: [8086:1592] type 00 class 0x020000
[ 0.433924] pci 0000:0c:00.0: reg 0x10: [mem 0xea000000-0xebffffff 64bit pref]
[ 0.433946] pci 0000:0c:00.0: reg 0x1c: [mem 0xee010000-0xee01ffff 64bit pref]
[ 0.433964] pci 0000:0c:00.0: reg 0x30: [mem 0xfcf00000-0xfcffffff pref]
[ 0.434104] pci 0000:0c:00.0: reg 0x184: [mem 0xed000000-0xed01ffff 64bit pref]
[ 0.434106] pci 0000:0c:00.0: VF(n) BAR0 space: [mem 0xed000000-0xedffffff 64bit pref] (contains BAR0 for 128 VFs)
[ 0.434128] pci 0000:0c:00.0: reg 0x190: [mem 0xee220000-0xee223fff 64bit pref]
[ 0.434129] pci 0000:0c:00.0: VF(n) BAR3 space: [mem 0xee220000-0xee41ffff 64bit pref] (contains BAR3 for 128 VFs)
[ 11.216343] ice 0000:0c:00.0: The DDP package was successfully loaded: ICE COMMS Package version 1.3.30.0
[ 11.280567] ice 0000:0c:00.0: PTP init successful
[ 11.317826] ice 0000:0c:00.0: DCB is enabled in the hardware, max number of TCs supported on this port are 8
[ 11.317828] ice 0000:0c:00.0: FW LLDP is disabled, DCBx/LLDP in SW mode.
[ 11.317829] ice 0000:0c:00.0: Commit DCB Configuration to the hardware
[ 11.320608] ice 0000:0c:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
```
And how the NIC shows up in VPP, in particular the rx/tx burst modes and functions are interesting:
```
hippo# show hardware-interfaces
Name Idx Link Hardware
HundredGigabitEthernet12/0/0 1 up HundredGigabitEthernet12/0/0
Link speed: 100 Gbps
RX Queues:
queue thread mode
0 vpp_wk_0 (1) polling
1 vpp_wk_1 (2) polling
2 vpp_wk_2 (3) polling
3 vpp_wk_3 (4) polling
4 vpp_wk_4 (5) polling
5 vpp_wk_5 (6) polling
6 vpp_wk_6 (7) polling
Ethernet address b4:96:91:b3:b1:10
Intel E810 Family
carrier up full duplex mtu 9190 promisc
flags: admin-up promisc maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum int-supported
Devargs:
rx: queues 7 (max 64), desc 1024 (min 64 max 4096 align 32)
tx: queues 16 (max 64), desc 1024 (min 64 max 4096 align 32)
pci: device 8086:1592 subsystem 8086:0002 address 0000:0c:00.00 numa 0
max rx packet len: 9728
promiscuous: unicast on all-multicast on
vlan offload: strip off filter off qinq off
rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum qinq-strip
outer-ipv4-cksum vlan-filter vlan-extend jumbo-frame
scatter keep-crc rss-hash
rx offload active: ipv4-cksum jumbo-frame scatter
tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum
tcp-tso outer-ipv4-cksum qinq-insert multi-segs mbuf-fast-free
outer-udp-cksum
tx offload active: ipv4-cksum udp-cksum tcp-cksum multi-segs
rss avail: ipv4-frag ipv4-tcp ipv4-udp ipv4-sctp ipv4-other ipv4
ipv6-frag ipv6-tcp ipv6-udp ipv6-sctp ipv6-other ipv6
l2-payload
rss active: ipv4-frag ipv4-tcp ipv4-udp ipv4 ipv6-frag ipv6-tcp
ipv6-udp ipv6
tx burst mode: Scalar
tx burst function: ice_recv_scattered_pkts_vec_avx2_offload
rx burst mode: Offload Vector AVX2 Scattered
rx burst function: ice_xmit_pkts
```
Finally, in case it's interesting, an output of [lscpu](/assets/vpp/l2-xconnect-lscpu.txt),
[lspci](/assets/vpp/l2-xconnect-lspci.txt) and [dmidecode](/assets/vpp/l2-xconnect-dmidecode.txt)
as run on Hippo (Rhino is an identical machine).

View File

@ -0,0 +1,375 @@
---
date: "2022-02-14T11:35:14Z"
title: Case Study - VLAN Gymnastics with VPP
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
After completing the Linux CP plugin, interfaces and their attributes such as addresses and routes
can be shared between VPP and the Linux kernel in a clever way, so running software like
[FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
&gt;100Mpps and &gt;100Gbps forwarding rates are easily in reach! But after the controlplane is
up and running, VPP has so much more to offer - many interesting L2 and L3 services that you'd expect
in commercial (and very pricy) routers like Cisco ASR are well within reach.
When Fred and I were in Paris [[report]({%post_url 2021-06-01-paris %})], I got stuck trying to
configure an Ethernet over MPLS circuit for IPng from Paris to Zurich. Fred took a look for me
and quickly determined "Ah, you forgot to do the VLAN gymnastics". I found it a fun way to describe
the solution to my problem back then, and come to think of it: the router really can be configured
to hook up anything to pretty much anything -- this post takes a look at similar flexibility in VPP.
## Introduction
When I first started learning how to work on Cisco's Advanced Services Router platform (Cisco IOS/XR),
I was surprised that there is no concept of a _switch_. As many network engineers, I was used to be able
to put a number of ports in the same switch VLAN; and take a different set of ports and put them into L3
mode with an IPv4/IPv6 address, or activate MPLS. And I was used to combining these two concepts by
creating VLAN (L3) interfaces.
Turning to VPP, much like its commercial sibling Cisco IOS/XR, the mental model and approach they take
is different. Each physical interface can have a number of sub-interfaces which carry an encapsulation,
for example a _dot1q_, or a _dot1ad_ or even a double-tagged (QinQ or QinAD). When ethernet frames
arrive on the physical interface, VPP will match them to the sub-interface which is configured to
receive frames of that specific encapsulation, and drop frames that do not match any sub-interface.
### Sub Interfaces in VPP
There are several forms of sub-interface, let's take a look at them:
```
1. create sub <interface> <subId> dot1q|dot1ad <vlanId>
2. create sub <interface> <subId> dot1q|dot1ad <vlanId> exact-match
3. create sub <interface> <subId> dot1q|dot1ad <vlanId> inner-dot1q <vlanId>|any
4. create sub <interface> <subId> dot1q|dot1ad <vlanId> inner-dot1q <vlanId> exact-match
5. create sub <interface> <subId>
6. create sub <interface> <subId>-<subId>
7. create sub <interface> <subId> untagged
8. create sub <interface> <subId> default
```
Alright, that's a lot of choice! Let me go over these one by one.
1. The first variant creates a sub-interface which will match frames with the first VLAN tag being
either _dot1q_ or _dot1ad_ with the given **vlanId**. An important note to this: there might be
more VLAN tags following in the ethernet frame, ie the frame may be _QinQ_ or _QinAD_, and all
of these will be matched.
1. The second variant looks to do the same thing, but there, the frame will only match if there is
**exactly one** VLAN tag, not more, not less. So this sub-interface will not match frames which are
_QinQ_ or _QinAD_.
1. The third variant creates a sub-interface which matches an outer _dot1q_ or _dot1ad_ VLAN and in
addition an inner _dot1q_ tag. The special keyword **any** can be specified, which will make the
sub-interface match _QinQ_ or _QinAD_ frames without caring which inner tag is used.
1. The fourth variant looks a bit like the second one, in that it will match for frames which have
**exactly two** VLAN tags (either _dot1q_._dot1q_ or _dot1ad_._dot1q_). In this _exact-match_ mode of
operation, precisely those two tags must be present, and no other tags may follow.
1. The fifth variant is simply a shorthand for the second one, it creates an exact-match dot1q with
a **vlanId** equal to the given **subId**. This is the most obvious form, and people will recognize
this as "just" a VLAN :)
1. The sixth variant further expands on this pattern, and creates a list of these dot1q exact-match
(eg. 100-200 will create 101 sub-interfaces).
1. The seventh variant creates a sub-interface that matches any frames that have **exactly zero**
tags (ie. _untagged_), and finally
1. The eighth variant might match anything that is not matched in any other sub-interface (ie. the
fallthrough _default_).
When I first saw this, it seemed overly complicated to me, but now that I've gotten to know this
way of thinking, what's being presented here is a way for any physical interface to branch off
inbound traffic based on either exactly zero (_untagged_), exactly one (_dot1q_ or _dot1ad_ with
_exact-match_), exactly two (outer _dot1q_ or _dot1ad_ followed by _inner-dot1q_ with _exact-match_),
and one outer tag followed by _any_ inner tag(s). In other words, any combination of zero, one or
two present tags on the frame can be matched and acted on by this logic.
A few other considerations:
* If a sub-interface is created with a given _dot1q_ or _dot1ad_ tag, you can't have another sub-interface
with a diffent matching logic on that same tag, for example creating `dot1q 100` means you can't
then also create `dot1q 100 exact-match`. If that behavior is desired, then you'll want to create
`dot1q 100 inner-dot1q any` followed by `dot1q 100 exact-match`
* For L3 interfaces, it only makes sense to have _exact-match_ interfaces. I found a bug
in VPP that leads to a crash, which I've fixed in [[this gerrit](https://gerrit.fd.io/r/c/vpp/+/33444)],
so now the API and CLI throw an error instead of taking down the router.
### Bridge Domains
So how do we make the functional equivalent of a VLAN, where several interfaces are bound together into
an L2 broadcast domain, like a regular switch might do? The VPP answer to this is a _bridge-domain_
which I can create and give a number, and then add any interface to it, like so:
```
vpp# create bridge-domain 10
vpp# set interface l2 bridge GigabitEthernet10/0/0 10
vpp# set interface l2 bridge BondEthernet0 10
```
And if I want to add an IP address (creating the equivalent of a routable _VLAN Interface_), I create what
is called a Bridge Virtual Interface or _BVI_, add that interface to the bridge domain, and optionally
expose it in Linux with the [LinuxCP]({%post_url 2021-09-21-vpp-7%}) plugin:
```
vpp# bvi create instance 10 mac 02:fe:4b:4c:22:8f
vpp# set interface l2 bridge bvi10 10 bvi
vpp# set interface ip address bvi10 192.0.2.1/24
vpp# set interface ip address bvi10 2001:db8::1/64
vpp# lcp create bvi10 host-if bvi10
```
A bridge-domain is fully configurable - by default it'll participate in L2 learning, maintain a FIB (which
MAC addresses are seen behind which interface), and pass along ARP requests and Neighbor Discovery. But I can
configure it to turn on/off forwarding, ARP, handling of unknown unicast frames, and so on, the complete
list of functionality that can be changed at runtime:
```
set bridge-domain arp entry <bridge-domain-id> [<ip-addr> <mac-addr> [del] | del-all]
set bridge-domain arp term <bridge-domain-id> [disable]
set bridge-domain arp-ufwd <bridge-domain-id> [disable]
set bridge-domain default-learn-limit <maxentries>
set bridge-domain flood <bridge-domain-id> [disable]
set bridge-domain forward <bridge-domain-id> [disable]
set bridge-domain learn <bridge-domain-id> [disable]
set bridge-domain learn-limit <bridge-domain-id> <learn-limit>
set bridge-domain mac-age <bridge-domain-id> <mins>
set bridge-domain rewrite <bridge-domain> [disable]
set bridge-domain uu-flood <bridge-domain-id> [disable]
```
This makes bridge domains a very powerful concept, and actually much more powerful (a strict superset)
of what I might be able to configure on an L2 switch.
### L2 CrossConnect
I thought it'd be useful to point out another powerful concept, which made an appearance in my previous
post about [Virtual Leased Lines]({%post_url 2022-01-12-vpp-l2%}). If all I want to do is connect two
interfaces together, there won't be a need for learning, L2 FIB, and so on. It is computationally much
simpler to just take any frame received on interface A and transmit it out on interface B, unmodified.
This is known in VPP as a layer2 crossconnect, and can be configured like so:
```
vpp# set interface l2 xconnect GigabitEthernet10/0/0 GigabitEthernet10/0/3
vpp# set interface l2 xconnect GigabitEthernet10/0/3 GigabitEthernet10/0/0
```
I should point out that this has to be done in both directions. The first invocation will transmit any
frame received on Gi10/0/0 directly out on Gi10/0/3, and the second one will transmit any frame from Gi10/0/3
directly out on Gi10/0/0, turning this into a very efficient way to connect two interfaces together.
Obviously, this only works in pairs, if more interfaces have to be connected, the bridge-domain is
the way to go. That said, L2 cross connects are super common.
### Tag Rewriting
If I want to connect two tagged _sub_-interfaces together, for example Gi10/0/0.123 to Gi10/0/3.321,
things get a bit more complicated. When VPP receives the frame from the first interface, it'll arrive tagged
with VLAN 123, so what happens if that is l2 crossconnected to Gi10/0/3.321? The answer will surprise you,
so let's take a look:
```
vpp# set interface state GigabitEthernet10/0/0 up
vpp# set interface state GigabitEthernet10/0/3 up
vpp# create sub GigabitEthernet10/0/0 123
vpp# set interface state GigabitEthernet10/0/0.123 up
vpp# create sub GigabitEthernet10/0/3 321
vpp# set interface state GigabitEthernet10/0/3.321 up
vpp# set interface l2 xconnect GigabitEthernet10/0/0.123 GigabitEthernet10/0/3.321
vpp# set interface l2 xconnect GigabitEthernet10/0/3.321 GigabitEthernet10/0/0.123
```
If I send a packet into Gi10/0/0.123, the L2 crossconnect will copy the entire frame, unmodified
into Gi10/0/3.321, but how can that be? That interface Gi10/0/3.321 is tagged with VLAN 321! VPP will
end up sending the frame out on interface Gi10/0/3 tagged as **VLAN 123**. In the other direction, frames
received on Gi10/0/3.321 will be sent out tagged as **VLAN 321** on Gi10/0/0. This is certainly not
what I expected.
To address this, VPP can add or remove VLAN tags when it receives a frame, when it transmits a frame,
or both, let me show you this concept up close, as it's really powerful!
VLAN tag rewrite provides the ability to change the VLAN tags on a packet. Existing tags can be popped,
new tags can be pushed, and existing tags can be swapped with new tags. The rewrite feature is attached
to a sub-interface as input and output operations. The input operation is explicitly configured by CLI or
API calls, and the output operation is the symmetric opposite and is automatically derived from the input
operation.
* **POP**: For pop operations, the sub-interface encapsulation (the vlan tags specified when it was created)
must have at least the number of popped tags. e.g. the "pop 2" operation would be rejected on a
single-vlan interface. The output tag-rewrite operation will push the specified number of vlan
tags onto the packet before transmitting. The pushed tag values are taken from the sub-interface encapsulation
configuration.
* **PUSH**: For push operations, the ethertype (_dot1q_ or _dot1ad_) is also specified. The output tag-rewrite
operation for pushes is to pop the same number of tags off the packet. If the packet doesn't have enough tags
it is dropped.
* **TRANSLATE**: This is a combination of a pop and a push operation.
This may be confusing at first, so let me demonstrate how this works, by extending the example above. On
the machine connected to Gi10/0/0.123, I'll configure an IP address and try to ping its neighbor:
```
pim@hippo:~$ sudo ip link add link enp4s0f0 name vlan123 type vlan id 123
pim@hippo:~$ sudo ip link set vlan123 up
pim@hippo:~$ sudo ip addr add 192.0.2.1/30 dev vlan123
pim@hippo:~$ ping 192.0.2.2
PING 192.0.2.2 (192.0.2.2) 56(84) bytes of data.
...
```
On the other side, I'll tcpdump what comes out the Gi10/0/3 port (which, as I observed above, is not carrying
the tag, 321, but instead carrying the original ingress tag, 123):
```
16:33:59.489246 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 46:
ethertype 802.1Q (0x8100), vlan 123, p 0,
ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4),
Request who-has 192.0.2.2 tell 192.0.2.1, length 28
```
Now, to demonstrate tag rewriting, I will remove (pop) the ingress VLAN tag from Gi10/0/0.123 when
a packet is received:
```
vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 pop 1
16:37:42.721424 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 42:
ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4),
Request who-has 192.0.2.2 tell 192.0.2.1, length 28
```
There is no tag at all. What happened here is that when Gi10/0/0.123 received the frame, the 'pop' operation
stripped 1 VLAN tag off the frame. And as we'll see later, when that sub-interface transmits a frame, the
'pop' operation will add one VLAN tag (123) to the front of the frame.
Remember how I pointed out above that the 'pop' operation is symmetric? I can use that because if I were to
_also_ apply this on the Gi10/0/3.321 interface, then it will push the tag (of Gi10/0/3.321) onto the packet
before sending it, and of course the other way around as well:
```
vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 pop 1
vpp# set interface l2 tag-rewrite GigabitEthernet10/0/3.321 pop 1
16:41:00.352840 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 46:
ethertype 802.1Q (0x8100), vlan 321, p 0,
ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4),
Request who-has 192.0.2.2 tell 192.0.2.1, length 28
16:41:00.352867 fe:54:00:00:10:03 > fe:54:00:00:10:00, length 46:
ethertype 802.1Q (0x8100), vlan 321, p 0,
ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4),
Reply 192.0.2.2 is-at fe:54:00:00:10:03, length 28
```
Hey look, there's our ARP reply packet! That packet coming back into Gi10/0/3.321, when hitting the
tag-rewrite, will in turn remove the tag, and the 'pop' being symmetrical, will of course add a new
tag 123 on egress of Gi10/0/0.123, and I can now see connectivity end to end. Neat!
Other operations that are interesting, include arbitrarily adding a _dot1q_ tag (or even two tags):
```
vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 push dot1q 100
16:45:33.121049 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 50:
ethertype 802.1Q (0x8100), vlan 100, p 0,
ethertype 802.1Q (0x8100), vlan 123, p 0,
ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4),
Request who-has 192.0.2.2 tell 192.0.2.1, length 28
vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 push dot1q 100 200
16:48:15.936807 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 54:
ethertype 802.1Q (0x8100), vlan 100, p 0,
ethertype 802.1Q (0x8100), vlan 200, p 0,
ethertype 802.1Q (0x8100), vlan 123, p 0,
ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4),
Request who-has 192.0.2.2 tell 192.0.2.1, length 28
```
And finally, swapping (translating) VLAN tags:
```
vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 translate 1-1 dot1ad 100
16:50:56.705015 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 46:
ethertype 802.1Q-QinQ (0x88a8), vlan 100, p 0,
ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4),
Request who-has 192.0.2.2 tell 192.0.2.1, length 28
vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 translate 1-1 dot1q 321
vpp# set interface l2 tag-rewrite GigabitEthernet10/0/3.321 translate 1-1 dot1q 123
16:44:03.462842 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 46:
ethertype 802.1Q (0x8100), vlan 321, p 0,
ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4),
Request who-has 192.0.2.2 tell 192.0.2.1, length 28
16:44:03.462847 fe:54:00:00:10:03 > fe:54:00:00:10:00, length 46:
ethertype 802.1Q (0x8100), vlan 321, p 0,
ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4),
Reply 192.0.2.2 is-at fe:54:00:00:10:03, length 28
```
This last set of 'translate 1-1' has a similar effect to the 'pop 1', the VLAN is rewritten to 321
when receiving from Gi10/0/0.123, and it's rewritten to 123 when receiving from Gi10/0/3.321, making
end to end traffic possible again.
#### Final conclusion
The four concepts discussed here can be combined in countless interesting ways:
* Create sub-interface with or without exact-match, to handle certain encapsulated packets
* Provide layer2 crossconnect functionality between any two interfaces or sub-interfaces
* Add multiple interfaces and sub-interfaces into a bridge-domain
* Ensure that VLAN tags are popped and pushed consistently on tagged sub-interfaces
The practical conclusion is that VPP can provide fully transparent, dot1q and jumboframe enabled
virtual leased lines (see my previous post on [VLL performance]({%post_url 2022-01-12-vpp-l2 %})),
including using regular breakout switches to greatly increase the total port count for customers.
I'll leave you with a working example of an L2VPN between a breakout switch behind **nlams0.ipng.ch**
in Amsterdam and a remote VPP router in Zurich called **ddln0.ipng.ch**. Take the following
[S5860-20SQ]({%post_url 2021-08-07-fs-switch %}) switch, which connects to the VPP router on Te0/1
and a customer on Te0/2:
```
fsw0(config)#vlan 3438
fsw0(config-vlan)#name v-vll-customer
fsw0(config-vlan)#exit
fsw0(config)#interface TenGigabitEthernet 0/1
fsw0(config-if-TenGigabitEthernet 0/1)#description Core: nlams0.ipng.ch Te6/0/0
fsw0(config-if-TenGigabitEthernet 0/1)#mtu 9216
fsw0(config-if-TenGigabitEthernet 0/1)#switchport mode trunk
fsw0(config-if-TenGigabitEthernet 0/1)#switchport trunk allowed vlan add 3438
fsw0(config)#interface TenGigabitEthernet 0/2
fsw0(config-if-TenGigabitEthernet 0/2)#description Cust: Customer VLL Port NIKHEF
fsw0(config-if-TenGigabitEthernet 0/2)#mtu 1522
fsw0(config-if-TenGigabitEthernet 0/2)#switchport mode dot1q-tunnel
fsw0(config-if-TenGigabitEthernet 0/2)#switchport dot1q-tunnel native vlan 3438
fsw0(config-if-TenGigabitEthernet 0/2)#switchport dot1q-tunnel allowed vlan add untagged 3438
fsw0(config-if-TenGigabitEthernet 0/2)#switchport dot1q-tunnel allowed vlan add tagged 1000-2000
```
I configure the first port here to be a VLAN trunk port to the router, and add VLAN 3438 to it. Then,
I configure the second port to be a customer _dot1q-tunnel_ port, which accepts untagged frames and
puts them in VLAN 3438, and additionally accepts tagged frames in VLAN 1000-2000 and prepends the
customer VLAN 3438 to them - so these will become QinQ double tagged 3438.1000-2000.
The corresponding snippet of the VPP router configuration as such:
```
comment { Customer VLL to DDLN }
lcp lcp-auto-sub-int off
create sub TenGigabitEthernet6/0/0 3438 dot1q 3438
set interface mtu packet 1518 TenGigabitEthernet6/0/0.3438
set interface state TenGigabitEthernet6/0/0.3438 up
set interface l2 tag-rewrite TenGigabitEthernet6/0/0.3438 pop 1
create vxlan tunnel instance 12 src 194.1.163.32 dst 194.1.163.5 vni 320501 decap-next l2
set interface state vxlan_tunnel12 up
set interface mtu packet 1518 vxlan_tunnel12
set interface l2 xconnect TenGigabitEthernet6/0/0.3438 vxlan_tunnel12
set interface l2 xconnect vxlan_tunnel12 TenGigabitEthernet6/0/0.3438
lcp lcp-auto-sub-int on
```
The customer facing interfaces have an MTU of 1518 bytes, which is enough for the 1500 bytes of the
IP packet, including 14 bytes of L2 overhead (src-mac, dst-mac, ethertype), and one optional VLAN tag.
In other words, this VLL is _dot1q_ capable, because the VPP sub-interface Te6/0/0.3438 did not specify
_exact-match_, so it'll accept any additional VLAN tags. Of course this does require the path from
**nlams0.ipng.ch** to **ddln0.ipng.ch** to be (baby)jumbo enabled, which they are as AS8298 is fully
9000 byte capable.

View File

@ -0,0 +1,778 @@
---
date: "2022-02-21T09:35:14Z"
title: 'Review: Cisco ASR9006/RSP440-SE'
---
## Introduction
{{< image width="180px" float="right" src="/assets/asr9006/ipmax.png" alt="IP-Max" >}}
If you've read up on my articles, you'll know that I have deployed a [European Ring]({%post_url 2021-02-27-network %}),
which was reformatted late last year into [AS8298]({%post_url 2021-10-24-as8298 %}) and upgraded to run
[VPP Routers]({%post_url 2021-09-21-vpp-7 %}) with 10G between each city. IPng Networks rents these 10G point to point
virtual leased lines between each of our locations. It's a really great network, and it performs so well because it's
built on an EoMPLS underlay provided by [IP-Max](https://ip-max.net/). They, in turn, run carrier grade hardware in the
form of Cisco ASR9k. In part, we're such a good match together, because my choice of [VPP](https://fd.io/) on the IPng
Networks routers fits very well with Fred's choice of [IOS/XR](https://en.wikipedia.org/wiki/Cisco_IOS_XR) on the
IP-Max routers.
And if you follow us on Twitter (I post as [@IPngNetworks](https://twitter.com/IPngNetworks/)), you may have seen a
recent post where I upgraded an aging [ASR9006](https://www.cisco.com/c/en/us/support/routers/asr-9006-router/model.html)
with a significantly larger [ASR9010](https://www.cisco.com/c/en/us/support/routers/asr-9010-router/model.html). The ASR9006
was initially deployed at Equinix Zurich ZH05 in Oberenstringen near Zurich, Switzerland in 2015, which is seven years ago.
It has hauled countless packets from Zurich to Paris, Frankfurt and Lausanne. When it was deployed, it came with a
A9K-RSP-4G route switch processor, which in 2019 was upgraded to the A9K-RSP-8G, and after so many hours^W years of
runtime needed a replacement. Also, IP-Max was starting to run out of ports for the chassis, hence the upgrade.
{{< image width="300px" float="left" src="/assets/asr9006/staging.png" alt="IP-Max" >}}
If you're interested in the line-up, there's this epic reference guide from [Cisco Live!](https://www.cisco.com/c/en/us/td/docs/iosxr/asr9000/hardware-install/overview-reference/b-asr9k-overview-reference-guide/b-asr9k-overview-reference-guide_chapter_010.html#con_733653)
that shows a deep dive of the ASR9k architecture. The chassis and power supplies can host several generations of silicon,
and even mix-and-match generations. So IP-Max ordered a few new RSPs, and after deploying the ASR9010 at ZH05, we made
plans to redeploy this ASR9006 at NTT Zurich in R&uuml;mlang next to the airport, to replace an even older Cisco 7600
at that location. Seeing as we have to order XFP optics (IP-Max has some DWDM/CWDM links in service at NTT), we have
to park the chassis in and around Zurich. What better place to park it, than in my lab ? :-)
The IPng Networks laboratory is where I do most of my work on [VPP](https://fd.io/). The rack you see to the left here holds my coveted
Rhino and Hippo (two beefy AMD Ryzen 5950X machines with 100G network cards), and a few Dells that comprise my VPP
lab. There was not enough room, so I gave this little fridge a place just adjacent to the rack, connected with 10x 10Gbps
and serial and management ports.
I immediately had a little giggle when booting up the machine. It comes with 4x 3kW power supply slots (3 are installed),
and when booting the machine, I was happy that there was no debris laying on the side or back of the router, as its
fans create a veritable vortex of airflow. Also, overnight the temperature in my basement lab + office room raised a
few degrees. It's now nice and toasty in my office, no need for the heater in the winter. Yet the machine stays quite
cool at 26C intake, consuming 2.2KW _idle_ with each of the two route processor (RSP440) drawing 240 Watts, each of the
three 8x TenGigE blades drawing 575W each, and the 40x GigE blade drawing a respectable 320 Watts.
```
RP/0/RSP0/CPU0:fridge(admin)#show environment power-supply
R/S/I Power Supply Voltage Current
(W) (V) (A)
0/PS0/M1/* 741.1 54.9 13.5
0/PS0/M2/* 712.4 54.8 13.0
0/PS0/M3/* 765.8 55.1 13.9
--------------
Total: 2219.3
```
For reference, Rhino and Hippo draw approximately 265W each, but they come with 4x1G, 4x10G, 2x100G and forward ~300Mpps
when fully loaded. By the end of this article, I hope you'll see why this is a funny juxtaposition to me.
### Installing the ASR9006
The Cisco RSPs came to me new-in-refurbished-box. When booting, I had no idea what username/password was used for the
preinstall, and none of the standard passwords worked. So the first order of business is to take ownership of the
machine. I do this by putting both RSPs in _rommon_ (which is done by sending _Break_ after powercycling the machine --
my choice of _tio(1)_ has ***Ctrl-t b*** as the magic incantation). The first RSP (in slot 0) is then set to a different
`confreg 0x142`, while the other is kept in rommon so it doesn't boot and take over the machine. After booting, I'm
then presented with a root user setup dialog. I create a user `pim` with some temporary password, set back the configuration
register, and reload. When the RSP is about to boot, I release the standby RSP to catch up, and voila: I'm _In like Flynn._
Wiring this up - I connect Te0/0/0/0 to IPng's office switch on port sfp-sfpplus9, and I assign the router an IPv4 and IPv6
address. Then, I connect four Tengig ports to the lab switch, so that I can play around with loadtests a little bit.
After turning on LLDP, I can see the following physical view:
```
RP/0/RSP0/CPU0:fridge#show lldp neighbors
Sun Feb 20 19:14:21.775 UTC
Capability codes:
(R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device
(W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other
Device ID Local Intf Hold-time Capability Port ID
xsw1-btl Te0/0/0/0 120 B,R bridge/sfp-sfpplus9
fsw0 Te0/1/0/0 41 P,B,R TenGigabitEthernet 0/9
fsw0 Te0/1/0/1 41 P,B,R TenGigabitEthernet 0/10
fsw0 Te0/2/0/0 41 P,B,R TenGigabitEthernet 0/7
fsw0 Te0/2/0/1 41 P,B,R TenGigabitEthernet 0/8
Total entries displayed: 5
```
First, I decide to hook up basic connectivity behind port Te0/0/0/0. I establish OSPF, OSPFv3 and this gives me
visibility to the route-reflectors at IPng's AS8298. Next, I also establish three IPv4 and IPv6 iBGP sessions, so
the machine enters the Default Free Zone (also, _daaayum_, that table keeps on growing at 903K IPv4 prefixes and
143K IPv6 prefixes).
```
RP/0/RSP0/CPU0:fridge#show ip ospf neighbor
Neighbor ID Pri State Dead Time Address Interface
194.1.163.3 1 2WAY/DROTHER 00:00:35 194.1.163.66 TenGigE0/0/0/0.101
Neighbor is up for 00:11:14
194.1.163.4 1 FULL/BDR 00:00:38 194.1.163.67 TenGigE0/0/0/0.101
Neighbor is up for 00:11:11
194.1.163.87 1 FULL/DR 00:00:37 194.1.163.87 TenGigE0/0/0/0.101
Neighbor is up for 00:11:12
RP/0/RSP0/CPU0:fridge#show ospfv3 neighbor
Neighbor ID Pri State Dead Time Interface ID Interface
194.1.163.87 1 FULL/DR 00:00:35 2 TenGigE0/0/0/0.101
Neighbor is up for 00:12:14
194.1.163.3 1 2WAY/DROTHER 00:00:33 16 TenGigE0/0/0/0.101
Neighbor is up for 00:12:16
194.1.163.4 1 FULL/BDR 00:00:36 20 TenGigE0/0/0/0.101
Neighbor is up for 00:12:12
RP/0/RSP0/CPU0:fridge#show bgp ipv4 uni sum
Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 915517 915517 915517 915517 915517 915517
Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
194.1.163.87 0 8298 172514 9 915517 0 0 00:04:47 903406
194.1.163.140 0 8298 171853 9 915517 0 0 00:04:56 903406
194.1.163.148 0 8298 176244 9 915517 0 0 00:04:49 903406
RP/0/RSP0/CPU0:fridge#show bgp ipv6 uni sum
Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer
Speaker 151597 151597 151597 151597 151597 151597
Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd
2001:678:d78:3::87
0 8298 54763 10 151597 0 0 00:05:19 142542
2001:678:d78:6::140
0 8298 51350 10 151597 0 0 00:05:23 142542
2001:678:d78:7::148
0 8298 54572 10 151597 0 0 00:05:25 142542
```
One of the acceptance tests of new hardware at AS25091 IP-Max is to ensure that it takes a full table
to help ensure memory is present, accounted for, and working. These route switch processor boards come
with 12GB of ECC memory, and can scale the routing table for a small while to come. If/when they are
at the end of their useful life, they will be replaced with A9K-RSP-880's, which will also give us
access to 40G and 100G and 24x10G SFP+ line cards. At that point, the upgrade path is much easier as
the chassis will already be installed. It's a matter of popping in new RSPs and replacing the line
cards one by one.
## Loadtesting the ASR9006/RSP440-SE
Now that this router has some basic connectivity, I'll do something that I always wanted to do: loadtest
an ASR9k! I have mad amounts of respect for Cisco's ASR9k series, but as we'll soon see, their stability
is their most redeeming quality, not their performance. Nowadays, many flashy 100G machines are around,
which do indeed have the performance, but not the stability! I've seen routers with an uptime of 7 years,
and BGP sessions and OSPF adjacencies with an uptime of 5 years+. It's just .. I've not seen that type
of stability beyond Cisco and maybe Juniper. So if you want _Rock Solid Internet_, this is definitely
the way to go.
I have written a word or two on how VPP (an open source dataplane very similar to these industrial machines)
works. A great example is my recent [VPP VLAN Gymnastics]({%post_url 2022-02-14-vpp-vlan-gym %}) article.
There's a lot I can learn from comparing the performance between VPP and Cisco ASR9k, so I will focus
on the following set of practical questions:
1. See if unidirectional versus bidirectional traffic impacts performance.
1. See if there is a performance penalty of using _Bundle-Ether_ (LACP controlled link aggregation).
1. Of course, replay my standard issue 1514b large packets, internet mix (_imix_) packets, small 64b packets
from random source/destination addresses (ie. multiple flows); and finally the killer test of small 64b
packets from a static source/destination address (ie. single flow).
This is in total 2 (uni/bi) x2 (lag/plain) x4 (packet mix) or 16 loadtest runs, for three forwarding types ...
1. See performance of L2VPN (Point-to-Point), similar to what VPP would call "l2 xconnect". I'll create an
L2 crossconnect between port Te0/1/0/0 and Te0/2/0/0; this is the simplest form computationally: it
forwards any frame received on the first interface directly out on the second interface.
1. Take a look at performance of L2VPN (Bridge Domain), what VPP would call "bridge-domain". I'll create a
Bridge Domain between port Te0/1/0/0 and Te0/2/0/0; this includes layer2 learning and FIB, and can tie
together any number of interfaces into a layer2 broadcast domain.
1. And of course, tablestakes, see performance of IPv4 forwarding, with Te0/1/0/0 as 100.64.0.1/30 and
Te0/2/0/0 as 100.64.1.1/30 and setting a static for 48.0.0.0/8 and 16.0.0.0/8 back to the loadtester.
... making a grand total of 48 loadtests. I have my work cut out for me! So I boot up Rhino, which has a
Mellanox ConnectX5-Ex (PCIe v4.0 x16) network card sporting two 100G interfaces, and it can easily keep up
with this 2x10G single interface, and 2x20G LAG, even with 64 byte packets. I am continually amazed that
a full line rate loadtest of small 64 byte packets at a rate of 40Gbps boils down to 59.52Mpps!
For each loadtest, I ramp up the traffic using a [T-Rex loadtester]({%post_url 2021-02-27-coloclue-loadtest %})
that I wrote. It starts with a low-pps warmup duration of 30s, then it ramps up from 0% to a certain line rate
(in this case, alternating to 10GbpsL1 for the single TenGig tests, or 20GbpsL1 for the LACP tests), with a
rampup duration of 120s and finally it holds for duration of 30s.
The following sections describe the methodology and the configuration statements on the ASR9k, with a quick
table of results per test, and a longer set of thoughts all the way at the bottom of this document. I so
encourage you to not skip ahead. Instead, read on and learn a bit (as I did!) from the configuration itself.
**The question to answer**: Can this beasty mini-fridge sustain line rate? Let's go take a look!
## Test 1 - 2x 10G
In this test, I configure a very simple physical environment (this is a good time to take another look at the LLDP table
above). The Cisco is connected with 4x 10G to the switch, Rhino and Hippo are connected with 2x 100G to the switch
and I have a Dell connected as well with 2x 10G to the switch (this can be very useful to take a look at what's going
on on the wire). The switch is an FS S5860-48SC (with 48x10G SFP+ ports, and 8x100G QSFP ports), which is a piece of
kit that I highly recommend by the way.
Its configuration:
```
interface TenGigabitEthernet 0/1
description Infra: Dell R720xd hvn0:enp5s0f0
no switchport
mtu 9216
!
interface TenGigabitEthernet 0/2
description Infra: Dell R720xd hvn0:enp5s0f1
no switchport
mtu 9216
!
interface TenGigabitEthernet 0/7
description Cust: Fridge Te0/2/0/0
mtu 9216
switchport access vlan 20
!
interface TenGigabitEthernet 0/9
description Cust: Fridge Te0/1/0/0
mtu 9216
switchport access vlan 10
!
interface HundredGigabitEthernet 0/53
description Cust: Rhino HundredGigabitEthernet15/0/1
mtu 9216
switchport access vlan 10
!
interface HundredGigabitEthernet 0/54
description Cust: Rhino HundredGigabitEthernet15/0/0
mtu 9216
switchport access vlan 20
!
monitor session 1 destination interface TenGigabitEthernet 0/1
monitor session 1 source vlan 10 rx
monitor session 2 destination interface TenGigabitEthernet 0/2
monitor session 2 source vlan 20 rx
```
What this does is connect Rhino's Hu15/0/1 and Fridge's Te0/1/0/0 in VLAN 10, and sends a readonly copy of all
traffic to the Dell's enp5s0f0 interface. Similarly, Rhino's Hu15/0/0 and Fridge's Te0/2/0/0 in VLAN 20 with a copy
of traffic to the Dell's enp5s0f1 interface. I can now run `tcpdump` on the Dell to see what's going back and forth.
In case you're curious: the monitor on Te0/1 and Te0/2 ports will saturate in case both machines are transmitting at
a combined rate of over 10Gbps. If this is the case, the traffic that doesn't fit is simply dropped from the monitor
port, but it's of course forwarded correctly between the original Hu0/53 and Te0/9 ports. In other words: the monitor
session has no performance penalty. It's merely a convenience to be able to take a look on ports where `tcpdump` is
not easily available (ie. both VPP as well as the ASR9k in this case!)
### Test 1.1: 10G L2 Cross Connect
A simple matter of virtually patching one interface into the other, I choose the first port on blade 1 and 2, and
tie them together in a `p2p` cross connect. In my [VLAN Gymnastics]({%post_url 2022-02-14-vpp-vlan-gym %}) post, I
called this a `l2 xconnect`, and although the configuration statements are a bit different, the purpose and expected
semantics are identical:
```
interface TenGigE0/1/0/0
l2transport
!
!
interface TenGigE0/2/0/0
l2transport
!
!
l2vpn
xconnect group loadtest
p2p xc01
interface TenGigE0/1/0/0
interface TenGigE0/2/0/0
!
!
```
The results of this loadtest look promising - although I can already see that the port will not sustain
line rate at 64 byte packets, which I find somewhat surprising. Both when using multiple flows (ie. random
source and destination IP addresses), as well as when using a single flow (repeating the same src/dst packet),
the machine tops out at around 20 Mpps which is 68% of line rate (29.76 Mpps). Fascinating!
Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps)
---------- | ---------- | ------------ | ----------- | ------------
1514b | 810 kpps | 9.94 Gbps | 1.61 Mpps | 19.77 Gbps
imix | 3.25 Mpps | 9.94 Gbps | 6.46 Mpps | 19.78 Gbps
64b Multi | 14.66 Mpps | 9.86 Gbps | 20.3 Mpps | 13.64 Gbps
64b Single | 14.28 Mpps | 9.60 Gbps | 20.3 Mpps | 13.62 Gbps
### Test 1.2: 10G L2 Bridge Domain
I then keep the two physical interfaces in `l2transport` mode, but change the type of l2vpn into a
`bridge-domain`, which I described in my [VLAN Gymnastics]({%post_url 2022-02-14-vpp-vlan-gym %}) post
as well. VPP and Cisco IOS/XR semantics look very similar indeed, they differ really only in the way
in which the configuration is expressed:
```
interface TenGigE0/1/0/0
l2transport
!
!
interface TenGigE0/2/0/0
l2transport
!
!
l2vpn
xconnect group loadtest
!
bridge group loadtest
bridge-domain bd01
interface TenGigE0/1/0/0
!
interface TenGigE0/2/0/0
!
!
!
!
```
Here, I find that performance in one direction is line rate, and with 64b packets ever so slightly better
than the L2 crossconnect test above. In both directions though, the router struggles to obtain line rate
in small packets, delivering 64% (or 19.0 Mpps) of the total offered 29.76 Mpps back to the loadtester.
Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps)
---------- | ---------- | ------------ | ----------- | ------------
1514b | 807 kpps | 9.91 Gbps | 1.63 Mpps | 19.96 Gbps
imix | 3.24 Mpps | 9.92 Gbps | 6.47 Mpps | 19.81 Gbps
64b Multi | 14.82 Mpps | 9.96 Gbps | 19.0 Mpps | 12.79 Gbps
64b Single | 14.86 Mpps | 9.98 Gbps | 19.0 Mpps | 12.81 Gbps
I would say that in practice, the performance of a bridge-domain is comparable to that of an L2XC.
### Test 1.3: 10G L3 IPv4 Routing
This is the most straight forward test: the T-Rex loadtester in this case is sourcing traffic from
100.64.0.2 on its first interface, and 100.64.1.2 on its second interface. It will send ARP for the
nexthop (100.64.0.1 and 100.64.1.1, the Cisco), but the Cisco will not maintain an ARP table for the
loadtester, so I have to add static ARP entries for it. Otherwise, this is a simple test, which stress
tests the IPv4 forwarding path:
```
interface TenGigE0/1/0/0
ipv4 address 100.64.0.1 255.255.255.252
!
interface TenGigE0/2/0/0
ipv4 address 100.64.1.1 255.255.255.252
!
router static
address-family ipv4 unicast
16.0.0.0/8 100.64.1.2
48.0.0.0/8 100.64.0.2
!
!
arp vrf default 100.64.0.2 043f.72c3.d048 ARPA
arp vrf default 100.64.1.2 043f.72c3.d049 ARPA
!
```
Alright, so the cracks definitely show on this loadtest. The performance of small routed packets is quite
poor, weighing in at 35% of line rate in the unidirectional test, and 43% in the bidirectional test. It seems
that the ASR9k (at least in this hardware profile of `l3xl`) is not happy forwarding traffic at line rate,
and the routing performance is indeed significantly lower than the L2VPN performance. That's good to know!
Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps)
---------- | ---------- | ------------ | ----------- | ------------
1514b | 815 kpps | 10.0 Gbps | 1.63 Mpps | 19.98 Gbps
imix | 3.27 Mpps | 9.99 Gbps | 6.52 Mpps | 19.96 Gbps
64b Multi | 5.14 Mpps | 3.45 Gbps | 12.3 Mpps | 8.28 Gbps
64b Single | 5.25 Mpps | 3.53 Gbps | 12.6 Mpps | 8.51 Gbps
## Test 2 - LACP 2x 20G
Link aggregation ([ref](https://en.wikipedia.org/wiki/Link_aggregation)) means combining or aggregating multiple
network connections in parallel by any of several methods, in order to increase throughput beyond what a single
connection could sustain, to provide redundancy in case one of the links should fail, or both. A link aggregation
group (LAG) is the combined collection of physical ports. Other umbrella terms used to describe the concept include
_trunking_, _bundling_, _bonding_, _channeling_ or _teaming_. Bundling ports together on a Cisco IOS/XR platform
like the ASR9k can be done by creating a _Bundle-Ether_ or _BE_. For reference, the same concept on VPP is called
a _BondEthernet_ and in Linux it'll often be referred to as simply a _bond_. They all refer to the same concept.
One thing that immediately comes to mind when thinking about LAGs is: how will the member port be selected on
outgoing traffic? A sensible approach will be to either hash on the L2 source and/or destination (ie. the ethernet
host on either side of the LAG), but in the case of a router and as is the case in our loadtest here, there is only
one MAC address on either side of the LAG. So a different hashing algorithm has to be chosen, preferably of the
source and/or destination _L3_ (IPv4 or IPv6) address. Luckily, both the FS switch as well as the Cisco ASR9006
support this.
First I'll reconfigure the switch, and then reconfigure the router to use the newly created 2x 20G LAG ports.
```
interface TenGigabitEthernet 0/7
description Cust: Fridge Te0/2/0/0
port-group 2 mode active
!
interface TenGigabitEthernet 0/8
description Cust: Fridge Te0/2/0/1
port-group 2 mode active
!
interface TenGigabitEthernet 0/9
description Cust: Fridge Te0/1/0/0
port-group 1 mode active
!
interface TenGigabitEthernet 0/10
description Cust: Fridge Te0/1/0/1
port-group 1 mode active
!
interface AggregatePort 1
mtu 9216
aggregateport load-balance dst-ip
switchport access vlan 10
!
interface AggregatePort 2
mtu 9216
aggregateport load-balance dst-ip
switchport access vlan 20
!
```
And after the Cisco is converted to use _Bundle-Ether_ as well, the link status looks like this:
```
fsw0#show int ag1
...
Aggregate Port Informations:
Aggregate Number: 1
Name: "AggregatePort 1"
Members: (count=2)
Lower Limit: 1
TenGigabitEthernet 0/9 Link Status: Up Lacp Status: bndl
TenGigabitEthernet 0/10 Link Status: Up Lacp Status: bndl
Load Balance by: Destination IP
fsw0#show int usage up
Interface Bandwidth Average Usage Output Usage Input Usage
-------------------------------- ----------- ---------------- ---------------- ----------------
TenGigabitEthernet 0/1 10000 Mbit 0.0000018300% 0.0000013100% 0.0000023500%
TenGigabitEthernet 0/2 10000 Mbit 0.0000003450% 0.0000004700% 0.0000002200%
TenGigabitEthernet 0/7 10000 Mbit 0.0000012350% 0.0000022900% 0.0000001800%
TenGigabitEthernet 0/8 10000 Mbit 0.0000011450% 0.0000021800% 0.0000001100%
TenGigabitEthernet 0/9 10000 Mbit 0.0000011350% 0.0000022300% 0.0000000400%
TenGigabitEthernet 0/10 10000 Mbit 0.0000016700% 0.0000022500% 0.0000010900%
HundredGigabitEthernet 0/53 100000 Mbit 0.00000011900% 0.00000023800% 0.00000000000%
HundredGigabitEthernet 0/54 100000 Mbit 0.00000012500% 0.00000025000% 0.00000000000%
AggregatePort 1 20000 Mbit 0.0000014600% 0.0000023400% 0.0000005799%
AggregatePort 2 20000 Mbit 0.0000019575% 0.0000023950% 0.0000015200%
```
It's clear that both `AggregatePort` interfaces have 20Gbps of capacity and are using an L3
loadbalancing policy. Cool beans!
If you recall my loadtest theory in for example my [Netgate 6100 review]({%post_url 2021-11-26-netgate-6100%}),
it can sometimes be useful to operate a single-flow loadtest, in which the source and destination
IP:Port stay the same. As I'll demonstrate, it's not only relevant for PC based routers like ones built
on VPP, it can also be very relevant in silicon vendors and high-end routers!
### Test 2.1 - 2x 20G LAG L2 Cross Connect
I scratched my head a little while (and with a little while I mean more like an hour or so!), because usually
I come across _Bundle-Ether_ interfaces which have hashing turned on in the interface stanza, but in my
first loadtest run I did not see any traffic on the second member port. I then found out that I need L2VPN
setting `l2vpn load-balancing flow src-dst-ip` applied rather than the Interface setting:
```
interface Bundle-Ether1
description LAG1
l2transport
!
!
interface TenGigE0/1/0/0
bundle id 1 mode active
!
interface TenGigE0/1/0/1
bundle id 1 mode active
!
interface Bundle-Ether2
description LAG2
l2transport
!
!
interface TenGigE0/2/0/0
bundle id 2 mode active
!
interface TenGigE0/2/0/1
bundle id 2 mode active
!
l2vpn
load-balancing flow src-dst-ip
xconnect group loadtest
p2p xc01
interface Bundle-Ether1
interface Bundle-Ether2
!
!
!
```
Overall, the router performs as well as can be expected. In the single-flow 64 byte test, however, due to
the hashing over the available members in the LAG being on L3 information, the router is forced to always
choose the same member and effectively perform at 10G throughput, so it'll get a pass from me on the 64b
single test. In the multi-flow test, I can see that it does indeed forward over both LAG members, however
it reaches only 34.9Mpps which is 59% of line rate.
Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps)
---------- | ---------- | ------------ | ----------- | ------------
1514b | 1.61 Mpps | 19.8 Gbps | 3.23 Mpps | 39.64 Gbps
imix | 6.40 Mpps | 19.8 Gbps | 12.8 Mpps | 39.53 Gbps
64b Multi | 29.44 Mpps | 19.8 Gbps | 34.9 Mpps | 23.48 Gbps
64b Single | 14.86 Mpps | 9.99 Gbps | 29.8 Mpps | 20.0 Gbps
### Test 2.2 - 2x 20G LAG Bridge Domain
Just like with Test 1.2 above, I can now transform this service from a Cross Connect into a fully formed
L2 bridge, by simply putting the two _Bundle-Ether_ interfaces in a _bridge-domain_ together, again
being careful to apply the L3 load-balancing policy on the `l2vpn` scope rather than the `interface`
scope:
```
l2vpn
load-balancing flow src-dst-ip
no xconnect group loadtest
bridge group loadtest
bridge-domain bd01
interface Bundle-Ether1
!
interface Bundle-Ether2
!
!
!
!
```
The results for this test show that indeed L2XC is computationally cheaper than _bridge-domain_ work. With
imix and 1514b packets, the router is fine and forwards 20G and 40G respectively. When the bridge is slammed
with 64 byte packets, its performance reaches only 65% with multiple flows in the unidirectional, and 47%
in the bidirectional loadtest. I found the performance difference with the L2 crossconnect above remarkable.
The single-flow loadtest cannot meaningfully stress both members of the LAG due to the src/dst being identical:
the best I can expect here, is 10G performance regardless how many LAG members there are.
Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps)
---------- | ---------- | ------------ | ----------- | ------------
1514b | 1.61 Mpps | 19.8 Gbps | 3.22 Mpps | 39.56 Gbps
imix | 6.39 Mpps | 19.8 Gbps | 12.8 Mpps | 39.58 Gbps
64b Multi | 20.12 Mpps | 13.5 Gbps | 28.2 Mpps | 18.93 Gbps
64b Single | 9.49 Mpps | 6.38 Gbps | 19.0 Mpps | 12.78 Gbps
### Test 2.3 - 2x 20G LAG L3 IPv4 Routing
And finally I turn my attention to the usual suspect: IPv4 routing. Here, I simply remove the `l2vpn`
stanza alltogether, and remember to put the load-balancing policy on the _Bundle-Ether_ interfaces.
This ensures that upon transmission, both members of the LAG are used. That is, if and only if the
IP src/dst addresses differ, which is the case in most, but not all of my loadtests :-)
```
no l2vpn
interface Bundle-Ether1
description LAG1
ipv4 address 100.64.1.1 255.255.255.252
bundle load-balancing hash src-ip
!
interface TenGigE0/1/0/0
bundle id 1 mode active
!
interface TenGigE0/1/0/1
bundle id 1 mode active
!
interface Bundle-Ether2
description LAG2
ipv4 address 100.64.0.1 255.255.255.252
bundle load-balancing hash src-ip
!
interface TenGigE0/2/0/0
bundle id 2 mode active
!
interface TenGigE0/2/0/1
bundle id 2 mode active
!
```
The LAG is fine at forwarding IPv4 traffic in 1514b and imix - full line rate and 40Gbps of traffic is
passed in the bidirectional test. With the 64b frames though, the forwarding performance is not line rate
but rather 84% of line in one direction, and 76% of line rate in the bidirectional test.
And once again, the single-flow loadtest cannot make use of more than one member port in the LAG, so it
will be constrained to 10G throughput -- that said, it performs at 42.6% of line rate only.
Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps)
---------- | ---------- | ------------ | ----------- | ------------
1514b | 1.63 Mpps | 20.0 Gbps | 3.25 Mpps | 39.92 Gbps
imix | 6.51 Mpps | 19.9 Gbps | 13.04 Mpps | 39.91 Gbps
64b Multi | 12.52 Mpps | 8.41 Gbps | 22.49 Mpps | 15.11 Gbps
64b Single | 6.49 Mpps | 4.36 Gbps | 11.62 Mpps | 7.81 Gbps
## Bonus - ASR9k linear scaling
{{< image width="300px" float="right" src="/assets/asr9006/loaded.png" alt="ASR9k Loaded" >}}
As I've shown above, the loadtests often topped out at well under line rate for tests with small packet sizes, but I
can also see that the LAG tests offered a higher performance, although not quite double that of single ports. I can't
help but wonder: is this perhaps ***a per-port limit*** rather than a router-wide limit?
To answer this question, I decide to pull out the stops and populate the ASR9k with as many XFPs as I have in my
stash, which is 9 pieces. One (Te0/0/0/0) still goes to uplink, because the machine should be carrying IGP and full
BGP tables at all times; which leaves me with 8x 10G XFPs, which I decide it might be nice to combine all three
scenarios in one test:
1. Test 1.1 with Te0/1/0/2 cross connected to Te0/2/0/2, with a loadtest at 20Gbps.
1. Test 1.2 with Te0/1/0/3 in a bridge-domain with Te0/2/0/3, also with a loadtest at 20Gbps.
1. Test 2.3 with Te0/1/0/0+Te0/2/0/0 on one end, and Te0/1/0/1+Te0/2/0/1 on the other end, with an IPv4
loadtest at 40Gbps.
### 64 byte packets
It would be unfair to use single-flow on the LAG, considering the hashing is on L3 source and/or destination IPv4
addresses, so really only one member port would be used. To avoid this pitfall, I run with `vm=var2`. On the other
two tests, however, I do run the most stringent of traffic pattern with single-flow loadtests. So off I go, firing
up ***three T-Rex*** instances.
First, the 10G L2 Cross Connect test (approximately 17.7Mpps):
```
Tx bps L2 | 7.64 Gbps | 7.64 Gbps | 15.27 Gbps
Tx bps L1 | 10.02 Gbps | 10.02 Gbps | 20.05 Gbps
Tx pps | 14.92 Mpps | 14.92 Mpps | 29.83 Mpps
Line Util. | 100.24 % | 100.24 % |
--- | | |
Rx bps | 4.52 Gbps | 4.52 Gbps | 9.05 Gbps
Rx pps | 8.84 Mpps | 8.84 Mpps | 17.67 Mpps
```
Then, the 10G Bridge Domain test (approximately 17.0Mpps):
```
Tx bps L2 | 7.61 Gbps | 7.61 Gbps | 15.22 Gbps
Tx bps L1 | 9.99 Gbps | 9.99 Gbps | 19.97 Gbps
Tx pps | 14.86 Mpps | 14.86 Mpps | 29.72 Mpps
Line Util. | 99.87 % | 99.87 % |
--- | | |
Rx bps | 4.36 Gbps | 4.36 Gbps | 8.72 Gbps
Rx pps | 8.51 Mpps | 8.51 Mpps | 17.02 Mpps
```
Finally, the 20G LAG IPv4 forwarding test (approximately 24.4Mpps), noting that the _Line Util._ here is of the 100G
loadtester ports, so 20% is expected:
```
Tx bps L2 | 15.22 Gbps | 15.23 Gbps | 30.45 Gbps
Tx bps L1 | 19.97 Gbps | 19.99 Gbps | 39.96 Gbps
Tx pps | 29.72 Mpps | 29.74 Mpps | 59.46 Mpps
Line Util. | 19.97 % | 19.99 % |
--- | | |
Rx bps | 5.68 Gbps | 6.82 Gbps | 12.51 Gbps
Rx pps | 11.1 Mpps | 13.33 Mpps | 24.43 Mpps
```
To summarize, in the above tests I am pumping 80Gbit (which is 8x 10Gbit full linerate at 64 byte packets, in
other words 119Mpps) into the machine, and it's returning 30.28Gbps (or 59.2Mpps which is 38%) of that traffic back
to the loadtesters. Features: yes; linerate: nope!
### 256 byte packets
Seeing the lowest performance of the router coming in at 8.5Mpps (or 57% of linerate), it stands to reason
that sending 256 byte packets will stay under the per-port observed packets/sec limits, so I decide to restart
the loadtesters with 256b packets. The expected ethernet frame is now 256 + 20 byte overhead, or 2208 bits,
of which ~4.53Mpps can fit into a 10G link. Immediately all ports go up entirely to full capacity. As seen from
the Cisco's commandline:
```
RP/0/RSP0/CPU0:fridge#show interfaces | utility egrep 'output.*packets/sec' | exclude 0 packets
Mon Feb 21 22:14:02.250 UTC
5 minute output rate 18390237000 bits/sec, 9075919 packets/sec
5 minute output rate 18391127000 bits/sec, 9056714 packets/sec
5 minute output rate 9278278000 bits/sec, 4547012 packets/sec
5 minute output rate 9242023000 bits/sec, 4528937 packets/sec
5 minute output rate 9287749000 bits/sec, 4563507 packets/sec
5 minute output rate 9273688000 bits/sec, 4537368 packets/sec
5 minute output rate 9237466000 bits/sec, 4519367 packets/sec
5 minute output rate 9289136000 bits/sec, 4562365 packets/sec
5 minute output rate 9290096000 bits/sec, 4554872 packets/sec
```
The first two ports there are _Bundle-Ether_ interface _BE1_ and _BE2_, and the other eight are the TenGigE
ports. You can see that each one is forwarding the expected 4.53Mpps, and this lines up perfectly with T-Rex
which is sending 10Gbps of L1, and 9.28Gbps of L2 (the difference here is the ethernet overhead of 20 bytes
per frame, or 4.53 * 160 bits = 724Mbps), and it's receiving all of that traffic back on the other side, which
is good.
This clearly demonstrates the hypothesis that the machine is ***per-port pps-bound***.
So the conclusion is that, the A9K-RSP440-SE typically will forward maybe only 8Mpps on a single TenGigE port, and
13Mpps on a two-member LAG. However, it will do this _for every port_, and with at least 8x 10G ports saturated,
it remained fully responsive, OSPF and iBGP adjacencies stayed up, and ping times on the regular (Te0/0/0/0)
uplink port were smooth.
## Results
### 1514b and imix: OK!
{{< image width="1200px" src="/assets/asr9006/results-imix.png" alt="ASR9k Results - imix" >}}
Let me start by showing a side-by-side comparison of the imix tests in all scenarios in the graph above. The
graph for 1514b tests looks very similar, differing only in the left-Y axis: imix is a 3.2Mpps stream, while
1514b saturates the 10G port already at 810Kpps. But obviously, the router can do this just fine, even if used
on 8 ports, it doesn't mind at all. As I later learned, any traffic mix larger than than 256b packets, or 4.5Mpps
per port, forwards fine in any configuration.
### 64b: Not so much :)
{{< image width="1200px" src="/assets/asr9006/results-64b.png" alt="ASR9k Results - 64b" >}}
{{< image width="1200px" src="/assets/asr9006/results-lacp-64b.png" alt="ASR9k Results - LACP 64b" >}}
These graphs show the throughput of the ASR9006 with a pair of A9K-RSP440-SE route switch processors. They
are rated at 440Gbps per slot, but their packets/sec rates are significantly lower than line rate. The top
graph shows the tests with 10G ports, and the bottom graph shows the same tests but with a 2x10G ports in
_Bundle-Ether_ LAG.
In an ideal situation, each test would follow the loadtester up to completion, and there would be no horizontal
lines breaking out partway through. As I showed, some of the loadtests really performed poorly in terms of
packets/sec forwarded. Understandably, the 20G LAG with single-flow can only utilize one member port (which is
logical) but then managed to push through only 6Mpps or so. Other tests did better, but overall I must say, the
results were lower than I had expected.
### That juxtaposition
At the very top of this article I alluded to what I think is a cool juxtaposition. On the one hand, we have these
beasty ASR9k routers, running idle at 2.2kW for 24x10G and 40x1G ports (as is the case for the IP-Max router that
I took out for a spin here). They are large (10U of rackspace), heavy (40kg loaded), expensive (who cares about
list price, the street price is easily $10'000,- apiece).
On the other hand, we have these PC based machines with Vector Packet Processing, operating as low as 19W for 2x10G,
2x1G and 4x2.5G ports (like the [Netgate 6100]({%post_url 2021-11-26-netgate-6100%})) and offering roughly equal
performance per port, except having to drop only $700,- apiece. The VPP machines come with ~infinite RAM, even a
16GB machine will run much larger routing tables, including full BGP and so on - there is no (need for) TCAM, and yet
routing performance scales out with CPUs and larger CPU instruction/data-cache. Looking at my Ryzen 5950X based Hippo/Rhino
VPP machines, they *can* sustain line rate 64b packets on their 10G ports, due to each CPU being able to process
around 22.3Mpps, and the machine has 15 usable CPU cores. Intel or Mellanox 100G network cards are affordable, the
whole machine with 2x100G, 4x10G and 4x1G will set me back about $3'000,- in 1U and run 265 Watts when fully loaded.
See an extended rationale with backing data in my [FOSDEM'22 talk](/media/fosdem22/index.html).
## Conclusion
I set out to answer three questions in this article, and I'm ready to opine now:
1. Unidirectional vs Bidirectional: there is an impact - bidirectional tests (stressing both ingress and egress
of each individual router port) have lower performance, notably in packets smaller than 256b.
1. LACP performance penalty: there is an impact - 64b multiflow loadtest on LAG obtained 59%, 47% and 42% (for
Test 2.1-3) while for single ports, they obtained 68%, 64% and 43% (for Test 1.1-3). So while aggregate
throughput grows with the LACP _Bundle-Ether_ ports, individual port throughput is reduced.
1. The router performs line rate 1514b, imix, and anything beyond 256b packets really. However, it does _not_
sustain line rate at 64b packets. Some tests passed with a unidirectional loadtest, but all tests failed
with bidirectional loadtests.
After all of these tests, I have to say I am ***still a huge fan*** of the ASR9k. I had kind of expected that it
would perform at line rate for any/all of my tests, but the theme became clear after a few - the ports will only
forward between 8Mpps and 11Mpps (out of the needed 14.88Mpps), but _every_ port will do that, which means
the machine will still scale up significantly in practice. But for business internet, colocation, and non-residential
purposes, I would argue that routing _stability_ is most important, and with regards to performance, I would argue
that _aggregate bandwidth_ is more important than pure _packets/sec_ performance. Finally, the ASR in Cisco ASR9k stands
for _Advanced Services Router_, and being able to mix-and-match MPLS, L2VPN, Bridges, encapsulation, tunneling, and
have an expectation of 8-10Mpps per 10G port is absolutely reasonable. The ASR9k is a very competent machine.
### Loadtest data
I've dropped all loadtest data [here](/assets/asr9006/asr9006-loadtest.tar.gz) and if you'd like to play around with
the data, take a look at the HTML files in [this directory](/assets/asr9006/), they were built with Michal's
[trex-loadtest-viz](https://github.com/wejn/trex-loadtest-viz/) scripts.
## Acknowledgements
I wanted to give a shout-out to Fred and the crew at IP-Max for allowing me to play with their router during
these loadtests. I'll be configuring it to replace their router at NTT in March, so if you have a connection
to SwissIX via IP-Max, you will be notified for maintenance ahead of time as we plan the maintenance window.
We call these things Fridges in the IP-Max world, because they emit so much cool air when they start :) The
ASR9001 is the microfridge, this ASR9006 is the minifridge, and the ASR9010 is the regular fridge.

View File

@ -0,0 +1,105 @@
---
date: "2022-02-24T13:46:12Z"
title: IPng Networks - Colocation
---
## Introduction
As with most companies, it started with an opportunity. I got my hands on a location which has a
raised floor at 60m2 and a significant power connection of 3x200A, and a metro fiber connection at
10Gbps. I asked my buddy Luuk 'what would it take to turn this into a colo?' and the rest is history.
Thanks to Daedalean AG who benefit from this infrastructure as well, making this first small colocation
site was not only interesting, but also very rewarding.
The colocation business is murder in Zurich - there are several very large datacenters (Equinix, NTT,
Coloz&uuml;ri, Interxion) all directly in or around the city, and I'm known to dwell in most of these. The
networking and service provider industry is quite small and well organized into _Network Operator Groups_,
so I work under the assumption that everybody knows everybody. I definitely like to pitch in and share
what I have built, both the physical bits but also the narrative.
This article describes the small serverroom I built at a partner's premises in Zurich Albisrieden. The
colo is _open for business_, that is to say: Please feel free to [reach out](/s/contact) if you're interested.
## Physical
{{< image width="180px" float="right" src="/assets/colo/power.png" alt="Power" >}}
It starts with a competent power distribution. Pictured to the right is a 200Amp 3-phase distribution panel
at Daedalean AG in Zurich. There's another similar panel on the other side of the floor, and both are
directly connected to EWZ and have plenty of smaller and larger breakers available (the room it's in used
to be a serverroom of the previous tenant, the City of Zurich).
{{< image width="180px" float="left" src="/assets/colo/eastron-sdm630.png" alt="Eastron SDM630" >}}
I start with installing a set of Eastron SDM630 power meters, so that I know what is being used
by IPng Networks, and can pay my dues, as well as remotely read the state and power consumption using
MODBUS, yielding two 3-phase supplies with 32A breakers on each.
{{< image width="180px" float="left" src="/assets/colo/pdus.png" alt="PDUs" >}}
Then, I go scouring on the Internet, to find a few second hand 19" racks. I actually find two 800x1000mm racks
but they are all the way across Switzerland. However, they're very affordable, but what's better, they each come
with two APC power distribution and remotely switchable zero-u power distribution strips. Score!
<hr />
{{< image width="180px" float="right" src="/assets/colo/racks-installed1.png" alt="Racks Installed" >}}
Laura and I rented a little (with which I mean: huge) minivan and went to pick up the racks. The folks at
Daedalean kindly helped us schlepp them up the stairs to the serverroom, and we installed the racks in the
serverroom, connecting them redundantly to power using the four PDUs. I have to be honest: there is no battery
or diesel backup in this room, as it's in the middle of the city and it'd be weird to have generators on site
for such a small room. It's a compromise we have to make.
{{< image width="180px" float="left" src="/assets/colo/racks-installed2.png" alt="Racks Installed w/ doors" >}}
Of course, I have to supply some form of eye-candy, so I decide to make a few decals for the racks, so that they
sport the _IPng @ DDLN_ designation. There are a few other racks and infrastructure in the same room, of course,
and it's cool to be able to identify IPng's kit upon entering the room. They even have doors, look!
The floor space here is about 60m2 of usable serverroom, so there is plenty of room to grow, and if the network
ever grows larger than 2x10G uplinks, it is definitely possible to rent dark fiber from this location thanks to
the liberal Swiss telco situation. But for now, we start small with 1x 10G layer2 backhaul to Interxion in
Glattbrugg. In 2022, I expect to expand with a second 10G layer2 backhaul to NTT in R&uuml;mlang to make the site
fully redundant.
<!-- ![Racks Pre-Installed](/assets/colo/racks-preinstall.png){: style="width:180px; float: right; margin-left: 2em; margin-bottom: 1em;"} -->
<!-- ![Storage](/assets/colo/storage.png){: style="width:180px; float: right; margin-left: 2em; margin-bottom: 1em;"} -->
## Logical
The physical situation is sorted, we have cooling, power, 19" racks with PDUs, and uplink connectivity. It's time
to think about a simple yet redundant colocation setup:
{{< image width="800px" src="/assets/colo/DDLN Logical Sketch.png" alt="Design" >}}
In this design, I'm keeping it relatively straight forward. The 10G ethernet leased line from Solnet plugs into one
switch, and the 10G leased line from Init7 plugs into the other. Everything is then built in pairs.
I bring:
* Two switches (Mikrotik CRS354, with 48x1G, 4x10G and 2x40G), two power supplies, connect them with 40G together.
* Two Dell R630 routers running VPP (of course), two power supplies, with 3x10G each:
* One leg goes back-to-back for OSPF/OSPFv3 between the two routers
* One leg goes to each switch; the "local" leg will be in a VLAN into the uplink VLL, and expose the router on the
colocation VLAN and any L2 backhaul services. The "remote" leg will be in a VLAN to the other uplink VLL.
* Two Supermicro hypervisors, each connected with 10G to their own switch
* Two PCEngines APU4 machines, each connected to Daedalean's corporate network for OOB
* These have serial connection to the PDUs and Mikrotik switches
* They also have mgmt network connection to the Dell VPP routers and Mikrotik switches
* They also run a Wireguard access service which exposes an IPMI VLAN for colo clusters
The result is that each of these can fail without disturbing traffic to/from the servers in the colocation. Each
server in the colo gets two power connections (one on each feed), two 1Gbps ports (one for IPMI and one for Internet).
The logical colocation network has VRRP configured for direct/live failover of IPv4 and IPv6 gateways, but the VPP
routers can offer full redundant IPv4 and IPv6 transit, as well as L2 backhaul to any other location where IPng
Networks has a presence (which is [quite a few](https://as8298.peeringdb.com/)).
## Conclusion
The colocation that I built, together with Daedalean, is very special. It's not carrier grade, it doesn't have
a building/room wide UPS or diesel generators, but it does have competent power, cooling, physical and logical
deployment. But most of all: it redundantly connects to AS8298 and offers full N+1 redundancy on the logical
level.
If you're interested in hosting a server in this colocation, [contact us](/s/contact/)!

View File

@ -0,0 +1,281 @@
---
date: "2022-03-03T19:05:14Z"
title: Syslog to Telegram
---
## Introduction
From time to time, I wish I could be made aware of failures earlier. There are two events, in particular,
that I am interested to know about very quickly, as they may impact service at AS8298:
1. _Open Shortest Path First_ (OSPF) adjacency removals. OSPF is a link-state protocol and it knows when
a physical link goes down, that the peer (neighbor) is no longer reachable. It can then recompute
paths to other routers fairly quickly. But if the link stays up but connectivity is interrupted,
for example because there is a switch in the path, it can take a relatively long time to detect.
1. _Bidirectional Forwarding Detection_ (BFD) session timeouts. BFD sets up a rapid (for example every
50ms or 20Hz) of a unidirectional UDP stream between two hosts. If a number of packets (for example
40 packets or 2 seconds) are not received, a link can be assumed to be dead.
Notably, [BIRD](https://bird.network.cz/), as many other vendors do, can combine the two. At IPng, each
OSPF adjacency is protected by BFD. What happens is that once an OSPF enabled link comes up, OSPF _Hello_
packets will be periodically transmitted (with a period called called a _Hello Timer_, typically once every
10 seconds). When a number of these are missed (called a _Dead Timer_, typically 40 seconds), the neighbor
is considered missing in action and the session cleaned up.
To help recover from link failure faster than 40 seconds, a new BFD session can be set up from any
neighbor that sends a _Hello_ packet. From then on, BFD will send a steady stream of UDP packets, and expect
as well the neighbor to send them. If BFD detects a timeout, it can inform BIRD to take action well
before the OSPF _Dead Timer_.
Very strict timers are known to be used, for example 10ms and 5 missed packets, or 50ms (!!) of timeout.
But at IPng, in the typical example above, I instruct BFD to send packets every 50ms, and time out
after 40 missed packets, or two (2) seconds of link downtime. Considering BIRD+VPP converge a full
routing table in about 7 seconds, that gives me an end-to-end recovery time of under 10 seconds, which
is respectable, all the while avoiding triggering on false positives.
I'd like to be made aware of these events, which could signal a darkfiber cut or WDM optic failure, or
an EoMPLS (ie _Virtual Leased Line_ or VLL failure), or a non-recoverable VPP dataplane crash. To a lesser
extent, being made explicitly aware of BGP adjacencies to downstream (IP Transit customers) or upstream
(IP Transit providers) can be useful.
### Syslog NG
There are two parts to this. First I want to have a (set of) central receiver servers, that will each
receive messages from the routers in the field. I decide to take three servers: the main one being
`nms.ipng.nl`, which runs LibreNMS, and further two read-only route collectors `rr0.ddln0.ipng.ch` at
our own DDLN [colocation]({%post_url 2022-02-24-colo %}) in Zurich, and `rr0.nlams0.ipng.ch` running
at Coloclue in DCG, Amsterdam.
Of course, it would be a mistake to use UDP as a transport for messages that discuss potential network
outages. Having receivers in multiple places in the network does help a little bit. But I decide to
configure the server (and the clients) later to use TCP. This way, messages are queued to be sent,
and if the TCP connection has to be rerouted when the underlying network converges, I am pretty certain
that the messages will arrive at the central logserver _eventually_.
#### Syslog Server
The configuration for each of the receiving servers is the same, very straight forward:
```
$ cat << EOF | sudo tee /etc/syslog-ng/conf.d/listen.conf
template t_remote {
template("\$ISODATE \$FULLHOST_FROM [\$LEVEL] \${PROGRAM}: \${MESSAGE}\n");
template_escape(no);
};
source s_network_tcp {
network( transport("tcp") ip("::") ip-protocol(6) port(601) max-connections(300) );
};
destination d_ipng { file("/var/log/ipng.log" template(t_remote) template-escape(no)); };
log { source(s_network_tcp); destination(d_ipng); };
EOF
$ sudo systemctl restart syslog-ng
```
First, I define a _template_ which logs in a consistent and predictable manner. Then, I configure a
_source_ which listens on IPv4 and IPv6 on TCP port 601, which allows for more than the default 10
connections. I configure a _destination_ into a file, using the template. Then I tie the log source
into the destination, and restart `syslog-ng`.
One thing that took me a while to realize is that for `syslog-ng`, the parser applied to incoming
messages is different depending on the port used ([ref](https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.16/administration-guide/17)):
* 514, both TCP and UDP, for [RFC3164](https://datatracker.ietf.org/doc/html/rfc3164) (BSD-syslog) formatted traffic
* 601 TCP, for [RFC5424](https://datatracker.ietf.org/doc/html/rfc5424) (IETF-syslog) formatted traffic
* 6514 TCP, for TLS-encrypted traffic (of IETF-syslog messages)
After seeing malformed messages in the syslog, notably with duplicate host/program/timestamp, I ultimately
understood that this was because I was sending RFC5424 style messages to an RFC3164 enabled port (514).
Once I moved the transport to be port 601, the parser matched and loglines were correct.
And another detail -- I feel a little bit proud for not forgetting to add a logrotate entry for
this new log file, keeping 10 days worth of compressed logs:
```
$ cat << EOF | sudo tee /etc/logrotate.d/syslog-ng-ipng
/var/log/ipng.log
{
rotate 10
daily
missingok
notifempty
delaycompress
compress
postrotate
invoke-rc.d syslog-ng reload > /dev/null
endscript
}
EOF
```
I open up the firewall in these new syslog servers for TCP port 601, from any loopback addresses on
AS8298's network.
#### Syslog Clients
The clients install `syslog-ng-core` (which avoids all of the extra packages). On the routers, I
have to make sure that the syslog server runs in the `dataplane` namespace, otherwise it will not
have connectivity to send its messages. And, quite importantly, I should make sure that the
TCP connections are bound to the loopback address of the router, not any arbitrary interface,
as those could go down, rendering the TCP connection useless. So taking `nlams0.ipng.ch` as
an example, here's a configuration snippet:
```
$ sudo apt install syslog-ng-core
$ sudo sed -i -e 's,ExecStart=,ExecStart=/usr/sbin/ip netns exec dataplane ,' \
/lib/systemd/system/syslog-ng.service
$ LO4=194.1.163.32
$ LO6=2001:678:d78::8
$ cat << EOF | sudo tee /etc/syslog-ng/conf.d/remote.conf
destination d_nms_tcp { tcp("194.1.163.89" localip("$LO4") port(601)); };
destination d_rr0_nlams0_tcp { tcp("2a02:898:146::4" localip("$LO6") port(601)); };
destination d_rr0_ddln0_tcp { tcp("2001:678:d78:4::1:4" localip("$LO6") port(601)); };
filter f_bird { program(bird); };
log { source(s_src); filter(f_bird); destination(d_nms_tcp); };
log { source(s_src); filter(f_bird); destination(d_rr0_nlams0_tcp); };
log { source(s_src); filter(f_bird); destination(d_rr0_ddln0_tcp); };
EOF
$ sudo systemctl restart syslog-ng
```
Here, I create simply three _destination_ entries, one for each log-sink. Then I create a _filter_
that grabs logs sent, but only for the BIRD server. You can imagine that later, I can add other things
to this -- for example `keepalived` for VRRP failovers. Finally, I tie these together by applying
the filter to the source and sending the result to each syslog server.
So far, so good.
### Bird
For consistency, (although not strictly necessary for the logging and further handling), I add
ISO data timestamping and enable syslogging in `/etc/bird/bird.conf`:
```
timeformat base iso long;
timeformat log iso long;
timeformat protocol iso long;
timeformat route iso long;
log syslog all;
```
And for the two protocols of interest, I add `debug { events };` to the BFD and OSPF protocols. Note
that `bfd on` stanza in the OSPF interfaces -- this instructs BIRD to create BFD session for each of
the neighbors that are found on such an interface, and if BFD were to fail, tear down the adjacency
faster than the regular _Dead Timer_ timeouts.
```
protocol bfd bfd1 {
debug { events };
interface "*" { interval 50 ms; multiplier 40; };
}
protocol ospf v2 ospf4 {
debug { events };
ipv4 { export filter ospf_export; import all; };
area 0 {
interface "loop0" { stub yes; };
interface "xe1-3.100" { type pointopoint; cost 61; bfd on; };
interface "xe1-3.200" { type pointopoint; cost 75; bfd on; };
};
}
```
This will emit loglines for (amongst others), state changes on BFD neighbors and OSPF adjacencies.
There are a lot of messages to choose from, but I found that the following messages contain the minimally
needed information to convey links going down or up (both from BFD's point of view as well as from OSPF
and OSPFv3's point of view). I can demonstrate that by making the link between Hippo and Rhino go down
(ie. by shutting the switchport, or unplugging the cable).
And after this, I can see on `nms.ipng.nl` that the logs start streaming in:
```
pim@nms:~$ tail -f /var/log/ipng.log | egrep '(ospf[46]|bfd1):.*changed state.*to (Down|Up|Full)'
2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to 192.168.10.17 changed state from Up to Down
2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: ospf4: Neighbor 192.168.10.1 on e2 changed state from Full to Down
2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to fe80::5054:ff:fe01:1001 changed state from Up to Down
2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: ospf6: Neighbor 192.168.10.1 on e2 changed state from Full to Down
2022-02-24T18:17:18+00:00 hippo.btl.ipng.ch [debug] bird: ospf6: Neighbor 192.168.10.1 on e2 changed state from Loading to Full
2022-02-24T18:17:18+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to fe80::5054:ff:fe01:1001 changed state from Init to Up
2022-02-24T18:17:22+00:00 hippo.btl.ipng.ch [debug] bird: ospf4: Neighbor 192.168.10.1 on e2 changed state from Loading to Full
2022-02-24T18:17:22+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to 192.168.10.17 changed state from Down to Up
```
And now I can see that important events are detected and sent , using reliable TCP transport, to multiple
logging machines, these messages about BFD and OSPF adjacency changes now make it to a central machine.
### Telegram Bot
{{< image width="150px" float="left" src="/assets/syslog-telegram/ptb-logo.png" alt="PTB" >}}
Of course I can go tail the logfile on one of the servers, but I think it'd be a bit more elegant to have
a computer do the pattern matching for me. One way might be to use the `syslog-ng` destination feature _program()_
([ref](https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.22/administration-guide/43)),
which pipes these logs through a userspace process, receiving them on stdin and doing interesting things with
them, such as interacting with Telegram, the delivery mechanism of choice for IPng's monitoring systems.
Building such a Telegram enabled bot is very straight forward, thanks to the excellent documentation of the
Telegram API, and the existence of `python-telegram-bot` ([ref](https://github.com/python-telegram-bot/python-telegram-bot)).
However, to avoid my bot from being tied at the hip to `syslog-ng`, I decide to simply tail a number of
logfiles from the commandline (ie `pttb /var/log/*.log`) - and here emerges the name of my little bot:
Python Telegram Tail Bot, or _pttb_ for short, that:
* Tails the syslog logstream from one or more files, ie `/var/log/ipng.log`
* Pattern matches on loglines, after which an `incident` is created
* Waits for a predefined number of seconds (which may be zero) to see if more loglines match, adding them to
the `incident`
* Holds the `incident` against a list of known regular expression `silences`, throwing away those which
aren't meant to be distributed
* Sending to a predefined group chat, those incidents which aren't silenced
The bot should allow for the following features, based on a YAML configuration file, which will allow it to be
restarted and upgraded:
* A (mandatory) `TOKEN` to interact with Telegram API
* A (mandatory) single `chat-id` - messages will be sent to this Telegram group chat
* An (optional) list of logline triggers, consisting of:
* a regular expression to match in the logstream
* a grace period to coalesce additional loglines of the same trigger into the incident
* a description to send once the incident is sent to Telegram
* An (optional) list of silences, consisting of:
* a regular expression to match any incident message data in
* an expiry timestamp
* a description carrying the reason for the silence
The bot will start up, announce itself on the `chat-id` group, and then listen on Telegram for the following
commands:
* **/help** - a list of available commands
* **/trigger** - without parameters, list the current triggers
* **/trigger add &lt;regexp&gt; [duration] [&lt;message&gt;]** - with one parameter set a trigger on a regular expression. Optionally,
add a duration in seconds between [0..3600>, within which additional matched loglines will be added to the
same incident, and an optional message to include in the Telegram alert.
* **/trigger del &lt;idx&gt;** - with one parameter, remove the trigger with that index (use /trigger to see the list).
* **/silence** - without parameters, list the current silences.
* **/silence add &lt;regexp&gt; [duration] [&lt;reason&gt;]** - with one parameter, set a default silence for 1d; optionally
add a duration in the form of `[1-9][0-9]*([hdm])` which defaults to hours (and can be days or minutes), and an optional
reason for the silence.
* **/silence del &lt;idx&gt;** - with one parameter, remove the silence with that index (use /silence to see the list).
* **/stfu [duration]** - a shorthand for a silence with regular expression `.*`, will suppress all notifications, with a
duration similar to the **/silence add** subcommand.
* **/stats** - shows some runtime statistics, notably how many loglines were processed, how many incidents created,
and how many were sent or suppressed due to a silence.
It will save its configuration file any time a silence or trigger is added or deleted. It will (obviously) then
start sending incidents to the `chat-id` group-chat when they occur.
## Results
And a few fun hours of hacking later, I submitted a first rough approxmiation of a useful syslog scanner telegram bot
on [Github](https://github.com/pimvanpelt/python-telegram-tail-bot). It does seem to work, although not all functions
are implemented yet (I'll get them done in the month of March, probably):
{{< image src="/assets/syslog-telegram/demo-telegram.png" alt="PTTB" >}}
So now I'll be pretty quickly and elegantly kept up to date by this logscanner, in addition to my already existing
LibreNMS logging, monitoring and alerting. If you find this stuff useful, feel free to grab a copy from
[Github](https://github.com/pimvanpelt/python-telegram-tail-bot), the code is open source and licensed with a liberal
APACHE 2.0 license, and is based on excellent work of [Python Telegram Bot](https://github.com/python-telegram-bot/python-telegram-bot).

View File

@ -0,0 +1,444 @@
---
date: "2022-03-27T14:19:23Z"
title: VPP Configuration - Part1
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
I use VPP - Vector Packet Processor - extensively at IPng Networks. Earlier this year, the VPP community
merged the [Linux Control Plane]({%post_url 2021-08-12-vpp-1 %}) plugin. I wrote about its deployment
to both regular servers like the [Supermicro]({%post_url 2021-09-21-vpp-7 %}) routers that run on our
[AS8298]({% post_url 2021-02-27-network %}), as well as virtual machines running in
[KVM/Qemu]({% post_url 2021-12-23-vpp-playground %}).
Now that I've been running VPP in production for about half a year, I can't help but notice one specific
drawback: VPP is a programmable dataplane, and _by design_ it does not include any configuration or
controlplane management stack. It's meant to be integrated into a full stack by operators. For end-users,
this unfortunately means that typing on the CLI won't persist any configuration, and if VPP is restarted,
it will not pick up where it left off. There's one developer convenience in the form of the `exec`
command-line (and startup.conf!) option, which will read a file and apply the contents to the CLI line
by line. However, if any typo is made in the file, processing immediately stops. It's meant as a convenience
for VPP developers, and is certainly not a useful configuration method for all but the simplest topologies.
Luckily, VPP comes with an extensive set of APIs to allow it to be programmed. So in this series of posts,
I'll detail the work I've done to create a configuration utility that can take a YAML configuration file,
compare it to a running VPP instance, and step-by-step plan through the API calls needed to safely apply
the configuration to the dataplane. Welcome to `vppcfg`!
In this first post, let's take a look at tablestakes: writing a YAML specification which models the main
configuration elements of VPP, and then ensures that the YAML file is both syntactically as well as
semantically correct.
**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
or reach out by [contacting us](/s/contact/).
## YAML Specification
I decide to use [Yamale](https://github.com/23andMe/Yamale/), which is a schema description language
and validator for [YAML](http://www.yaml.org/spec/1.2/spec.html). YAML is a very simple, text/human-readable
annotation format that can be used to store a wide range of data types. An interesting, but quick introduction
to the YAML language can be found on CraftIRC's [GitHub](https://github.com/Animosity/CraftIRC/wiki/Complete-idiot's-introduction-to-yaml)
page.
The first order of business for me is to devise a YAML file specification which models the configuration
options of VPP objects in an idiomatic way. It's apealing to make the decision to immediately build a
higher level abstraction, but I resist the urge and instead look at the types of objects that exist in
VPP, for example the `VNET_DEVICE_CLASS` types:
* ***ethernet_simulated_device_class***: Loopbacks
* ***bvi_device_class***: Bridge Virtual Interfaces
* ***dpdk_device_class***: DPDK Interfaces
* ***rdma_device_class***: RDMA Interfaces
* ***bond_device_class***: BondEthernet Interfaces
* ***vxlan_device_class***: VXLAN Tunnels
There are several others, but I decide to start with these, as I'll be needing each one of these in my
own network. Looking over the device class specification, I learn a lot about how they are configured,
which arguments and of which type they need, and which data-structures they are represent as in VPP
internally.
### Syntax Validation
Yamale first reads a _schema_ definition file, and then holds a given YAML file against the definition
and shows if the file has a syntax that is well-formed or not. As a practical example, let me start
with the following definition:
```
$ cat << EOF > schema.yaml
sub-interfaces: map(include('sub-interface'),key=int())
---
sub-interface:
description: str(exclude='\'"',len=64,required=False)
lcp: str(max=15,matches='[a-z]+[a-z0-9-]*',required=False)
mtu: int(min=128,max=9216,required=False)
addresses: list(ip(version=6),required=False)
encapsulation: include('encapsulation',required=False)
---
encapsulation:
dot1q: int(min=1,max=4095,required=False)
dot1ad: int(min=1,max=4095,required=False)
inner-dot1q: int(min=1,max=4095,required=False)
exact-match: bool(required=False)
EOF
```
This snippet creates two types, one called `sub-interface` and the other called `encapsulation`. The fields
of the sub-interface, for example the `description` field, must follow the given typing to be valid. In the
case of the description, it must be at most 64 characters long and it must not contain the ` or "
characters. The designation `required=False` notes that this is an optional field and may be omitted.
The `lcp` field is also a string but it must match a certain regular expression, and start with a lowercase
letter. The `MTU` field must be an integer between 128 and 9216, and so on.
One nice feature of Yamale is the ability to reference other object types. I do this here with the `encapsulation`
field, which references an object type of the same name, and again, is optional. This means that when the
`encapsulation` field is encountered in the YAML file Yamale is validating, it'll hold the contents of that
field to the schema below. There, we have `dot1q`, `dot1ad`, `inner-dot1q` and `exact-match` fields, which are
all optional.
Then, at the top of the file, I create the entrypoint schema, which expects YAML files to contain a map
called `sub-interfaces` which is keyed by integers and contains values of type `sub-interface`, tying it all
together.
Yamale comes with a commandline utility to do direct schema validation, which is handy. Let me demonstrate with
the following terrible YAML:
```
$ cat << EOF > bad.yaml
sub-interfaces:
100:
description: "Pim's illegal description"
lcp: "NotAGoodName-AmIRite"
mtu: 16384
addresses: 192.0.2.1
encapsulation: False
EOF
$ yamale -s schemal.yaml bad.yaml
Validating /home/pim/bad.yaml...
Validation failed!
Error validating data '/home/pim/bad.yaml' with schema '/home/pim/schema.yaml'
sub-interfaces.100.description: 'Pim's illegal description' contains excluded character '''
sub-interfaces.100.lcp: Length of NotAGoodName-AmIRite is greater than 15
sub-interfaces.100.lcp: NotAGoodName-AmIRite is not a regex match.
sub-interfaces.100.mtu: 16384 is greater than 9216
sub-interfaces.100.addresses: '192.0.2.1' is not a list.
sub-interfaces.100.encapsulation : 'False' is not a map
```
This file trips so many syntax violations, it should be a crime! In fact every single field is invalid. The one that
is closest to being correct is the `addresses` field, but there I've set it up as a _list_ (not a scalar), and even
then, the list elements are expected to be IPv6 addresses, not IPv4 ones.
So let me try again:
```
$ cat << EOF > good.yaml
sub-interfaces:
100:
description: "Core: switch.example.com Te0/1"
lcp: "xe3-0-0"
mtu: 9216
addresses: [ 2001:db8::1, 2001:db8:1::1 ]
encapsulation:
dot1q: 100
exact-match: True
EOF
$ yamale good.yaml
Validating /home/pim/good.yaml...
Validation success! 👍
```
### Semantic Validation
When using Yamale, I can make a good start in _syntax_ validation, that is to say, if a field is present, it follows
a prescribed type. But that's not the whole story, though. There are many configuration files I can think of that
would be syntactically correct, but still make no sense in practice. For example, creating an encapsulation which
has both `dot1q` as well as `dot1ad`, or creating a _LIP_ (Linux Interface Pair) for sub-interface which does not
have `exact-match` set. Or how's about having two sub-interfaces with the same exact encapsulation?
Here's where _semantic_ validation comes in to play. So I set out to create all sorts of constraints, and after
reading the (Yamale validated, so syntactically correct) YAML file, I can hand it into a set of validators that
check for violations of these constraints. By means of example, let me create a few constraints that might capture
the issues described above:
1. If a sub-interface has encapsulation:
1. It MUST have `dot1q` OR `dot1ad` set
1. It MUST NOT have `dot1q` AND `dot1ad` both set
1. If a sub-interface has one or more `addresses`:
1. Its encapsulation MUST be set to `exact-match`
1. It MUST have an `lcp` set.
1. Each individual `address` MUST NOT occur in any other interface
## Config Validation
After spending a few weeks thinking about the problem, I came up with 59 semantic constraints, that is to say
things that might appear OK, but will yield impossible to implement or otherwise erratic VPP configurations.
This article would be a bad place to discuss them all, so I will talk about the structure of `vppcfg` instead.
First, a `Validator` class is instantiated with the Yamale schema. Then, a YAML file is read and passed to the
validator's `validate()` method. It will first run Yamale on the YAML file and make note of any issues that arise.
If so, it will enumerate them in a list and return (bool, [list-of-messages]). The validation will have failed
if the boolean returned is _false_, and if so, the list of messages will help understand which constraint was
violated.
The `vppcfg` schema consists of toplevel types, which are validated in order:
* ***validate_bondethernets()***'s job is to ensure that anything configured in the `bondethernets` toplevel map
is correct. For example, if a _BondEthernet_ device is created there, its members should reference existing
interfaces, and it itself should make an appearance in the `interfaces` map, and the MTU of each member should
be equal to the MTU of the _BondEthernet_, and so on. See `config/bondethernet.py` for a complete rundown.
* ***validate_loopbacks()*** is pretty straight forward. It makes a few common assertions, such as that if the
loopback has addresses, it must also have an LCP, and if it has an LCP, that no other interface has the same
LCP name, and that all of the addresses configured are unique.
* ***validate_vxlan_tunnels()*** Yamale already asserts that the `local` and `remote` fields are present and an
IP address. The semantic validator ensures that the address family of the tunnel endpoints are the same, and that
the used `VNI` is unique.
* ***validate_bridgedomains()*** fiddles with its _Bridge Virtual Interface_, making sure that its addresses and
LCP name are unique. Further, it makes sure that a given member interface is in at most one bridge, and that said
member is in L2 mode, in other words, that it doesn't have an LCP or an address. An L2 interface can be either in
a bridgedomain, or act as an L2 Cross Connect, but not both. Finally, it asserts that each member has an MTU
identical to the bridge's MTU value.
* ***validate_interfaces()*** is by far the most complex, but a few common things worth calling out is that each
sub-interface must have a unique encapsulation, and if a given QinQ or QinAD 2-tagged sub-interface has an LCP,
that there exist a parent Dot1Q or Dot1AD interface with the correct encapsulation, and that it also has an LCP.
See `config/interface.py` for an extensive overview.
## Testing
Of course, in a configuration model so complex as a VPP router, being able to do a lot of validation helps ensure that
the constraints above are implemented correctly. To help this along, I use _regular_ unittesting as provided by
the Python3 [unittest](https://docs.python.org/3/library/unittest.html) framework, but I extend it to run as well
a special kind of test which I call a `YAMLTest`.
### Unit Testing
This is bread and butter, and should be straight forward for software engineers. I took a model of so called
test-driven development, where I start off by writing a test, which of course fails because the code hasn't been
implemented yet. Then I implement the code, and run this and all other unittests expecting them to pass.
Let me give an example based on BondEthernets, with a YAML config file as follows:
```
bondethernets:
BondEthernet0:
interfaces: [ GigabitEthernet1/0/0, GigabitEthernet1/0/1 ]
interfaces:
GigabitEthernet1/0/0:
mtu: 3000
GigabitEthernet1/0/1:
mtu: 3000
GigabitEthernet2/0/0:
mtu: 3000
sub-interfaces:
100:
mtu: 2000
BondEthernet0:
mtu: 3000
lcp: "be012345678"
addresses: [ 192.0.2.1/29, 2001:db8::1/64 ]
sub-interfaces:
100:
mtu: 2000
addresses: [ 192.0.2.9/29, 2001:db8:1::1/64 ]
```
As I mentioned when discussing the semantic constraints, there's a few here that jump out at me. First, the
BondEthernet members `Gi1/0/0` and `Gi1/0/1` must exist. There is one BondEthernet defined in this file (obvious,
I know, but bear with me), and `Gi2/0/0` is not a bond member, and certainly `Gi2/0/0.100` is not a bond member,
because having a sub-interface as an LACP member would be super weird. Taking things like this into account, here's
a few tests that could assert that the behavior of the `bondethernets` map in the YAML config is correct:
```
class TestBondEthernetMethods(unittest.TestCase):
def setUp(self):
with open("unittest/test_bondethernet.yaml", "r") as f:
self.cfg = yaml.load(f, Loader = yaml.FullLoader)
def test_get_by_name(self):
ifname, iface = bondethernet.get_by_name(self.cfg, "BondEthernet0")
self.assertIsNotNone(iface)
self.assertEqual("BondEthernet0", ifname)
self.assertIn("GigabitEthernet1/0/0", iface['interfaces'])
self.assertNotIn("GigabitEthernet2/0/0", iface['interfaces'])
ifname, iface = bondethernet.get_by_name(self.cfg, "BondEthernet-notexist")
self.assertIsNone(iface)
self.assertIsNone(ifname)
def test_members(self):
self.assertTrue(bondethernet.is_bond_member(self.cfg, "GigabitEthernet1/0/0"))
self.assertTrue(bondethernet.is_bond_member(self.cfg, "GigabitEthernet1/0/1"))
self.assertFalse(bondethernet.is_bond_member(self.cfg, "GigabitEthernet2/0/0"))
self.assertFalse(bondethernet.is_bond_member(self.cfg, "GigabitEthernet2/0/0.100"))
def test_is_bondethernet(self):
self.assertTrue(bondethernet.is_bondethernet(self.cfg, "BondEthernet0"))
self.assertFalse(bondethernet.is_bondethernet(self.cfg, "BondEthernet-notexist"))
self.assertFalse(bondethernet.is_bondethernet(self.cfg, "GigabitEthernet1/0/0"))
def test_enumerators(self):
ifs = bondethernet.get_bondethernets(self.cfg)
self.assertEqual(len(ifs), 1)
self.assertIn("BondEthernet0", ifs)
self.assertNotIn("BondEthernet-noexist", ifs)
```
Every single function that is defined in the file `config/bondethernet.py` (there are four) will have
an accompanying unittest to ensure it works as expected. And every validator module, will have a suite
of unittests fully covering their functionality. In total, I wrote a few dozen unit tests like this,
in an attempt to be reasonably certain that the config validator functionality works as advertised.
### YAML Testing
I added one additional class of unittest called a ***YAMLTest***. What happens here is that a certain YAML configuration
file, which may be valid or have errors, is offered to the end to end config parser (so both the Yamale schema
validator as well as the semantic validators), and all errors are accounted for. As an example, two sub-interfaces
on the same parent cannot have the same encapsulation, so offering the following file to the config validator
is _expected_ to trip errors:
```
$ cat unittest/yaml/error-subinterface1.yaml << EOF
test:
description: "Two subinterfaces can't have the same encapsulation"
errors:
expected:
- "sub-interface .*.100 does not have unique encapsulation"
- "sub-interface .*.102 does not have unique encapsulation"
count: 2
---
interfaces:
GigabitEthernet1/0/0:
sub-interfaces:
100:
description: "VLAN 100"
101:
description: "Another VLAN 100, but without exact-match"
encapsulation:
dot1q: 100
102:
description: "Another VLAN 100, but without exact-match"
encapsulation:
dot1q: 100
exact-match: True
EOF
```
You can see the file here has two YAML documents (separated by `---`), the first one explains to the YAMLTest
class what to expect. There can either be no errors (in which case `test.errors.count=0`), or there can be
specific errors that are expected. In this case, `Gi1/0/0.100` and `Gi1/0/0/102` have the same encapsulation
but `Gi1/0/0.101` is unique (if you're curious, this is because the encap on 100 and 102 has exact-match,
but the one one 101 does _not_ have exact-match).
The implementation of this YAMLTest class is in `tests.py`, which in turn runs all YAML tests on the files it
finds in `unittest/yaml/*.yaml` (currently 47 specific cases are tested there, which covered 100% of the
semantic constraints), and regular unittests (currently 42, which is a coincidence, I swear!)
# What's next?
These tests, together, give me a pretty strong assurance that any given YAML file that passes the validator,
is indeed a valid configuration for VPP. In my next post, I'll go one step further, and talk about applying
the configuration to a running VPP instance, which is of course the overarching goal. But I would not want
to mess up my (or your!) VPP router by feeding it garbage, so the lions' share of my time so far on this project
has been to assert the YAML file is both syntactically and semantically valid.
In the mean time, you can take a look at my code on [GitHub](https://github.com/pimvanpelt/vppcfg), but to
whet your appetite, here's a hefty configuration that demonstrates all implemented types:
```
bondethernets:
BondEthernet0:
interfaces: [ GigabitEthernet3/0/0, GigabitEthernet3/0/1 ]
interfaces:
GigabitEthernet3/0/0:
mtu: 9000
description: "LAG #1"
GigabitEthernet3/0/1:
mtu: 9000
description: "LAG #2"
HundredGigabitEthernet12/0/0:
lcp: "ice0"
mtu: 9000
addresses: [ 192.0.2.17/30, 2001:db8:3::1/64 ]
sub-interfaces:
1234:
mtu: 1200
lcp: "ice0.1234"
encapsulation:
dot1q: 1234
exact-match: True
1235:
mtu: 1100
lcp: "ice0.1234.1000"
encapsulation:
dot1q: 1234
inner-dot1q: 1000
exact-match: True
HundredGigabitEthernet12/0/1:
mtu: 2000
description: "Bridged"
BondEthernet0:
mtu: 9000
lcp: "be0"
sub-interfaces:
100:
mtu: 2500
l2xc: BondEthernet0.200
encapsulation:
dot1q: 100
exact-match: False
200:
mtu: 2500
l2xc: BondEthernet0.100
encapsulation:
dot1q: 200
exact-match: False
500:
mtu: 2000
encapsulation:
dot1ad: 500
exact-match: False
501:
mtu: 2000
encapsulation:
dot1ad: 501
exact-match: False
vxlan_tunnel1:
mtu: 2000
loopbacks:
loop0:
lcp: "lo0"
addresses: [ 10.0.0.1/32, 2001:db8::1/128 ]
loop1:
lcp: "bvi1"
addresses: [ 10.0.1.1/24, 2001:db8:1::1/64 ]
bridgedomains:
bd1:
mtu: 2000
bvi: loop1
interfaces: [ BondEthernet0.500, BondEthernet0.501, HundredGigabitEthernet12/0/1, vxlan_tunnel1 ]
bd11:
mtu: 1500
vxlan_tunnels:
vxlan_tunnel1:
local: 192.0.2.1
remote: 192.0.2.2
vni: 101
```
The vision for my VPP Configuration utility is that it can move from any existing VPP configuration to any
other (validated successfully) configuration with a minimal amount of steps, and that it will plan its
way declaratively from A to B, ordering the calls to the API safely and quickly. Interested? Good, because
I do expect that a utility like this would be very valuable to serious VPP users!

View File

@ -0,0 +1,727 @@
---
date: "2022-04-02T08:50:19Z"
title: VPP Configuration - Part2
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
I use VPP - Vector Packet Processor - extensively at IPng Networks. Earlier this year, the VPP community
merged the [Linux Control Plane]({%post_url 2021-08-12-vpp-1 %}) plugin. I wrote about its deployment
to both regular servers like the [Supermicro]({%post_url 2021-09-21-vpp-7 %}) routers that run on our
[AS8298]({% post_url 2021-02-27-network %}), as well as virtual machines running in
[KVM/Qemu]({% post_url 2021-12-23-vpp-playground %}).
Now that I've been running VPP in production for about half a year, I can't help but notice one specific
drawback: VPP is a programmable dataplane, and _by design_ it does not include any configuration or
controlplane management stack. It's meant to be integrated into a full stack by operators. For end-users,
this unfortunately means that typing on the CLI won't persist any configuration, and if VPP is restarted,
it will not pick up where it left off. There's one developer convenience in the form of the `exec`
command-line (and startup.conf!) option, which will read a file and apply the contents to the CLI line
by line. However, if any typo is made in the file, processing immediately stops. It's meant as a convenience
for VPP developers, and is certainly not a useful configuration method for all but the simplest topologies.
Luckily, VPP comes with an extensive set of APIs to allow it to be programmed. So in this series of posts,
I'll detail the work I've done to create a configuration utility that can take a YAML configuration file,
compare it to a running VPP instance, and step-by-step plan through the API calls needed to safely apply
the configuration to the dataplane. Welcome to `vppcfg`!
In this second post of the series, I want to talk a little bit about how planning a path from a running
configuration to a desired new configuration might look like.
**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
or reach out by [contacting us](/s/contact/).
## VPP Config: a DAG
Before we dive into my `vppcfg` code, let me first introduce a mental model of how configuration is built. We
rarely stop and think about it, but when we configure our routers (no matter if it's a Cisco or a Juniper or
a VPP router), in our mind we logically order the operations in a very particular way. To state the obvious,
if I want to create a sub-interface which also has an address, I would create the sub-int _before_ adding the
address, right? Similarly, if I wanted to expose a sub-interface `Hu12/0/0.100` in Linux as a _LIP_, I would
create it only _after_ having created a _LIP_ for the parent interface `Hu12/0/0`, to satisfy Linux's
requirement all sub-interfaces have a parent interface, like so:
```
vpp# create sub HundredGigabitEthernet12/0/0 100
vpp# set interface ip address HundredGigabitEthernet12/0/0.100 192.0.2.1/29
vpp# lcp create HundredGigabitEthernet12/0/0 host-if ice0
vpp# lcp create HundredGigabitEthernet12/0/0.100 host-if ice0.100
vpp# set interface state HundredGigabitEthernet12/0/0 up
vpp# set interface state HundredGigabitEthernet12/0/0.100 up
```
Of course some of the ordering doesn't strictly matter. For example, I can set the state of
`Hu12/0/0.100` up before adding the address, or after adding the address, or even after adding the
_LIP_, but one thing is certain: I cannot set its state to up before it was created in the first place!
In the other direction, when removing things, it's easy to see that you cannot manipulate the state
of a sub-interface after deleting it, so to cleanly remove the construction above, I would have to
walk the statements back in reverse, like so:
```
vpp# set interface state HundredGigabitEthernet12/0/0.100 down
vpp# set interface state HundredGigabitEthernet12/0/0 down
vpp# lcp delete HundredGigabitEthernet12/0/0.100 host-if ice0.100
vpp# lcp delete HundredGigabitEthernet12/0/0 host-if ice0
vpp# set interface ip address del HundredGigabitEthernet12/0/0.100 192.0.2.1/29
vpp# delete sub HundredGigabitEthernet12/0/0.100
```
Because of this reasonably straight forward ordering, it's possible to construct a graph showing
operations that depend on other operations having been completed beforehand. Such a graph, called
a [Directed Acyclic Graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) or _DAG_.
{{< image width="400px" float="left" src="/assets/vppcfg/vppcfg-dag.png" alt="DAG" >}}
First some theory (from Wikipedia): A directed graph is a DAG if and only if it can be topologically
ordered, by arranging the vertices as a linear ordering that is consistent with all edge directions.
DAGs have numerous scientific and computational applications, but the one I'm mostly interested here
is dependency mapping and computational scheduling.
A graph is formed by vertices and by edges connecting pairs of vertices, where the vertices are
objects that might exist in VPP (interfaces, bridge-domains, VXLAN tunnels, IP addresses, etc),
and these objects are connected in pairs by edges. In the case of a directed graph, each edge has an
orientation (or direction), from one (source) vertex to another (destination) vertex. A path in a
directed graph is a sequence of edges having the property that the ending vertex of each edge in the
sequence is the same as the starting vertex of the next edge in the sequence; a path forms a cycle
if the starting vertex of its first edge equals the ending vertex of its last edge. A directed acyclic
graph is a directed graph that has no cycles, which in this particular case means that objects'
existence can't rely other things that ultimately rely back on their own existence.
After I got that technobabble out of the way, practically speaking, the _edges_ in this graph model
dependencies, let me give a few examples:
1. The arrow from _Sub Interface_ pointing at _BondEther_ and _Physical Int_ makes the claim that
for the sub-int to exist, it _depends on_ the existence of either a BondEthernet, or a PHY.
1. The arrow from the _BondEther_ to the _Physical Int_, which makes the claim that for the BondEthernet
to work, it must have one or more PHYs in it.
1. There is no arrow between _BondEther_ and _Sub Interface_ which makes the claim that they are
independent, there is no need for a sub-int to exist in order for a BondEthernet to work.
## VPP Config: Ordering
In my [previous]({% post_url 2022-03-27-vppcfg-1 %}) post, I talked about a bunch of constraints that
make certain YAML configurations invalid (for example, having both _dot1q_ and _dot1ad_ on a sub-interface,
that wouldn't make any sense). Here, I'm going to talk about another type of constraint: ***Temporal
Constraints*** are statements about the ordering of operations. With the example DAG above, I derive the
following constraints:
* A parent interface must exist before a sub-interface can be created on it
* An interface (regardless of sub-int or phy) must exist before an IP address can be added to it
* A _LIP_ can be created on a sub-int only if its parent PHY has a _LIP_
* _LIPs_ must be removed from all sub-interfaces before a PHY's _LIP_ can be removed
* The admin-state of a sub-interface can only be up if its PHY is up
* ... and so on.
But there's a second thing to keep in mind, and this is a bit more specific to the VPP configuration
operations themselves. Sometimes, I may find that an object already exists, say a sub-interface, but
that it has configuration attributes that are not what I wanted. For example, I may have previously
configured a sub-int to be of a certain encapsulation `dot1q 1000 inner-dot1q 1234`, but I changed
my mind and want the sub-int to now be `dot1ad 1000 inner-dot1q 1234` instead. Some attributes of
an interface can be changed on the fly (like the MTU, for example), but some really cannot, and in
my example here, the encapsulation change has to be done another way.
I'll make an obvious but hopefully helpful observation: I can't create the second sub-int with
the same subid, because one already exists (duh). The intuitive way to solve this, of course, is to
delete the old sub-int _first_ and then create a _new_ sub-int with the correct attributes (`dot1ad`
outer encapsulation).
Here's another scenario that illustrates the ordering: Let's say I want to move an IP address
from interface A to interface B. In VPP, I can't configure the same IP address/prefixlen on two
interfaces at the same time, so as with the previous scenario of the encap changing, I will want
to remove the IP address from A before adding it to B.
Come to think of it, there are lots of scenarios where remove-before-add is required:
* If an interface was in bridge-domain A but now wants to be put in bridge-domain B, it'll have
to be _removed_ from the first bridge before being _added_ to the second bridge, because an
interface can't be in two bridges at the same time.
* If an interface was a member of a BondEthernet, but will be moved to be a member of a
bridge-domain now, it will have to be _removed_ from the bond before being _added_ to the
bridge, because an interface can't be both a bondethernet member and a member of a bridge
at the same time.
* And to add to the list, the scenario above: A sub-interface that differs in its intended
encapsulation must be _removed_ before a new one with the same `subid` can be _created_.
All of these cases can be modeled as edges (arrows) between vertices (objects) in the graph
describing the ordering of operations in VPP! I'm now ready to draw two important conclusions:
1. All objects that differ from their intended configuration must be removed before being
added elsewhere, in order to avoid them being referenced/used twice.
1. All objects must be created before their attributes can be set.
### vppcfg: Path Planning
By thinking about the configuration in this way, I can precisely predict the order of
operations needed to go from any running dataplane configuration to _any new_ target
dataplane configuration. A so called path-planner emerges, which has three main phases of
execution:
1. **Prune** phase (remove objects from VPP that are not in the config)
1. **Create** phase (add objects to VPP that are in the config but not VPP)
1. **Sync** phase, for each object in the configuration
When removing things, care has to be taken to remove inner-most objects first (first removing
LCP, then QinQ, Dot1Q, BondEthernet, and lastly PHY), because indeed, there exists a dependency
relationship between objects in this DAG. Conversely, when creating objects, the edges flip their
directionality, because creation must be done on outer-most objects first (first creating the
PHY, then BondEthernet, Dot1Q, QinQ and lastly LCP).
For example, QinQ/QinAD sub-interfaces should be removed before before their intermediary
Dot1Q/Dot1AD can be removed. Another example, MTU of parents should raise before their children,
while children should shrink before their parent.
**Order matters**.
**Pruning**: First, `vppcfg` will ensure all objects do not have attributes which they should not (eg. IP
addresses) and that objects are destroyed that are not needed (ie. have been removed from the
target config). After this phase, I am certain that any object that exists in the dataplane,
both (a) has the right to exist (because it's in the target configuration), and (b) has the
correct create-time (ie non syncable) attributes.
**Creating**: Next, `vppcfg` will ensure that all objects that are not yet present (including the ones that
it just removed because they were present but had incorrect attributes), get (re)created in the
right order. After this phase, I am certain that _all objects_ in the dataplane now (a) have the
right to exist (because they are in the target configuration), (b) have the correct attributes,
but newly, also that (c) all objects that are in the target configuration also got created and
now exist in the dataplane.
**Syncing**: Finally, all objects are synchronized with the target configuration (IP addresses,
MTU etc), taking care to shrink children before their parents, and growing parents before their
children (this is for the special case of any given sub-interface's MTU having to be equal to or
lower than their parent's MTU).
### vppcfg: Demonstration
I'll create three configurations and let vppcfg path-plan between them. I start a completely
empty VPP dataplane which has two GigabitEthernet and two HundredGigabitEthernet interfaces:
```
pim@hippo:~/src/vpp$ make run
_______ _ _ _____ ___
__/ __/ _ \ (_)__ | | / / _ \/ _ \
_/ _// // / / / _ \ | |/ / ___/ ___/
/_/ /____(_)_/\___/ |___/_/ /_/
DBGvpp# show interface
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
GigabitEthernet3/0/0 1 down 9000/0/0/0
GigabitEthernet3/0/1 2 down 9000/0/0/0
HundredGigabitEthernet12/0/0 3 down 9000/0/0/0
HundredGigabitEthernet12/0/1 4 down 9000/0/0/0
local0 0 down 0/0/0/0
```
#### Demo 1: First time config (empty VPP)
First, starting simple, I write the following YAML configuration called `hippo4.yaml`. It defines a
few sub-interfaces, a bridgedomain with one QinQ sub-interface `Hu12/0/0.101` in it, and it then
cross-connects `Gi3/0/0.100` with `Hu12/0/1.100`, keeping all sub-interfaces at an MTU of 2000 and
their PHYs at an MTU of 9216:
```
interfaces:
GigabitEthernet3/0/0:
mtu: 9216
sub-interfaces:
100:
mtu: 2000
l2xc: HundredGigabitEthernet12/0/1.100
GigabitEthernet3/0/1:
description: Not Used
HundredGigabitEthernet12/0/0:
mtu: 9216
sub-interfaces:
100:
mtu: 3000
101:
mtu: 2000
encapsulation:
dot1q: 100
inner-dot1q: 200
exact-match: True
HundredGigabitEthernet12/0/1:
mtu: 9216
sub-interfaces:
100:
mtu: 2000
l2xc: GigabitEthernet3/0/0.100
bridgedomains:
bd10:
description: "Bridge Domain 10"
mtu: 2000
interfaces: [ HundredGigabitEthernet12/0/0.101 ]
```
If I offer this config to `vppcfg` and ask it to plan a path, there won't be any **pruning** going on,
because there are no objects in the newly started VPP dataplane that need to be deleted. But I do expect
to see a bunch of sub-interface and one bridge-domain **creation**, followed by **syncing** a bunch of
interfaces with bridge-domain memberships and L2 Cross Connects. Finally, the MTU of the interfaces will
be sync'd to their configured values, and the path is planned like so:
```
pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo4.yaml plan
[INFO ] root.main: Loading configfile hippo4.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823
create sub GigabitEthernet3/0/0 100 dot1q 100 exact-match
create sub HundredGigabitEthernet12/0/0 100 dot1q 100 exact-match
create sub HundredGigabitEthernet12/0/1 100 dot1q 100 exact-match
create sub HundredGigabitEthernet12/0/0 101 dot1q 100 inner-dot1q 200 exact-match
create bridge-domain 10
set interface l2 bridge HundredGigabitEthernet12/0/0.101 10
set interface l2 tag-rewrite HundredGigabitEthernet12/0/0.101 pop 2
set interface l2 xconnect GigabitEthernet3/0/0.100 HundredGigabitEthernet12/0/1.100
set interface l2 tag-rewrite GigabitEthernet3/0/0.100 pop 1
set interface l2 xconnect HundredGigabitEthernet12/0/1.100 GigabitEthernet3/0/0.100
set interface l2 tag-rewrite HundredGigabitEthernet12/0/1.100 pop 1
set interface mtu 9216 GigabitEthernet3/0/0
set interface mtu 9216 HundredGigabitEthernet12/0/0
set interface mtu 9216 HundredGigabitEthernet12/0/1
set interface mtu packet 1500 GigabitEthernet3/0/1
set interface mtu packet 9216 GigabitEthernet3/0/0
set interface mtu packet 9216 HundredGigabitEthernet12/0/0
set interface mtu packet 9216 HundredGigabitEthernet12/0/1
set interface mtu packet 2000 GigabitEthernet3/0/0.100
set interface mtu packet 3000 HundredGigabitEthernet12/0/0.100
set interface mtu packet 2000 HundredGigabitEthernet12/0/1.100
set interface mtu packet 2000 HundredGigabitEthernet12/0/0.101
set interface mtu 1500 GigabitEthernet3/0/1
set interface state GigabitEthernet3/0/0 up
set interface state GigabitEthernet3/0/0.100 up
set interface state GigabitEthernet3/0/1 up
set interface state HundredGigabitEthernet12/0/0 up
set interface state HundredGigabitEthernet12/0/0.100 up
set interface state HundredGigabitEthernet12/0/0.101 up
set interface state HundredGigabitEthernet12/0/1 up
set interface state HundredGigabitEthernet12/0/1.100 up
[INFO ] root.main: Planning succeeded
```
On the `vppctl` commandline, I can simply cut-and-paste these CLI commands and the dataplane ends up
configured exactly like was desired in the `hippo4.yaml` configuration file. One nice way to tell if
the reconciliation of the config file into the running VPP instance was successful is by running the
planner again with the same YAML config file. It should not find anything worth pruning, creating nor
syncing, and indeed:
```
pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo4.yaml plan
[INFO ] root.main: Loading configfile hippo4.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823
[INFO ] root.main: Planning succeeded
```
#### Demo 2: Moving from one config to another
To demonstrate how my reconciliation algorithm works in practice, I decide to invent a radically
different configuration for Hippo, called `hippo12.yaml`, in which a new BondEthernet appears,
two of its sub-interfaces are cross connected, `Hu12/0/0` now gets a _LIP_ and some IP addresses, and
the bridge-domain `bd10` is replaced by two others, `bd1` and `bd11`, the former of which also sports
a BVI (with a _LIP_ called `bvi1`) and a VXLAN Tunnel bridged into `bd1` for good measure:
```
bondethernets:
BondEthernet0:
interfaces: [ GigabitEthernet3/0/0, GigabitEthernet3/0/1 ]
interfaces:
GigabitEthernet3/0/0:
mtu: 9000
description: "LAG #1"
GigabitEthernet3/0/1:
mtu: 9000
description: "LAG #2"
HundredGigabitEthernet12/0/0:
lcp: "ice12-0-0"
mtu: 9000
addresses: [ 192.0.2.17/30, 2001:db8:3::1/64 ]
sub-interfaces:
1234:
mtu: 1200
lcp: "ice0.1234"
encapsulation:
dot1q: 1234
exact-match: True
1235:
mtu: 1100
lcp: "ice0.1234.1000"
encapsulation:
dot1q: 1234
inner-dot1q: 1000
exact-match: True
HundredGigabitEthernet12/0/1:
mtu: 2000
description: "Bridged"
BondEthernet0:
mtu: 9000
lcp: "bond0"
sub-interfaces:
10:
lcp: "bond0.10"
mtu: 3000
100:
mtu: 2500
l2xc: BondEthernet0.200
encapsulation:
dot1q: 100
exact-match: False
200:
mtu: 2500
l2xc: BondEthernet0.100
encapsulation:
dot1q: 200
exact-match: False
500:
mtu: 2000
encapsulation:
dot1ad: 500
exact-match: False
501:
mtu: 2000
encapsulation:
dot1ad: 501
exact-match: False
vxlan_tunnel1:
mtu: 2000
loopbacks:
loop0:
lcp: "lo0"
addresses: [ 10.0.0.1/32, 2001:db8::1/128 ]
loop1:
lcp: "bvi1"
addresses: [ 10.0.1.1/24, 2001:db8:1::1/64 ]
bridgedomains:
bd1:
mtu: 2000
bvi: loop1
interfaces: [ BondEthernet0.500, BondEthernet0.501, HundredGigabitEthernet12/0/1, vxlan_tunnel1 ]
bd11:
mtu: 1500
vxlan_tunnels:
vxlan_tunnel1:
local: 192.0.2.1
remote: 192.0.2.2
vni: 101
```
Before writing `vppcfg`, the art of moving from `hippo4.yaml` to this radically different `hippo12.yaml`
would be a nightmare, and almost certainly have caused me to miss a step and cause an outage. But, due to
the fundamental understanding of ordering, and the methodical execution of **pruning**, **creating** and
**syncing** the objects, the path planner comes up with the following sequence, which I'll break down
in its three constituent phases:
```
pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo12.yaml plan
[INFO ] root.main: Loading configfile hippo12.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823
set interface state HundredGigabitEthernet12/0/0.101 down
set interface state GigabitEthernet3/0/0.100 down
set interface state HundredGigabitEthernet12/0/0.100 down
set interface state HundredGigabitEthernet12/0/1.100 down
set interface l2 tag-rewrite HundredGigabitEthernet12/0/0.101 disable
set interface l3 HundredGigabitEthernet12/0/0.101
create bridge-domain 10 del
set interface l2 tag-rewrite GigabitEthernet3/0/0.100 disable
set interface l3 GigabitEthernet3/0/0.100
set interface l2 tag-rewrite HundredGigabitEthernet12/0/1.100 disable
set interface l3 HundredGigabitEthernet12/0/1.100
delete sub HundredGigabitEthernet12/0/0.101
delete sub GigabitEthernet3/0/0.100
delete sub HundredGigabitEthernet12/0/0.100
delete sub HundredGigabitEthernet12/0/1.100
```
First, `vppcfg` concludes that `Hu12/0/0.101`, `Hu12/0/1.100` and `Gi3/0/0.100` are no longer
needed, so it sets them all admin-state down. The bridge-domain `bd10` no longer has the right to
exist, the poor thing. But before it is deleted, the interface that was in `bd10` can be pruned
(membership _depends_ on the bridge, so in pruning, dependencies are removed before dependents).
Considering `Hu12/0/1.101` and `Gi3/0/0.100` were an L2XC pair before, they are returned to default
(L3) mode and because it's no longer needed, the [VLAN Gymnastics]({%post_url 2022-02-14-vpp-vlan-gym %})
tag rewriting is also cleaned up for both interfaces. Finally, the sub-interfaces that do not appear
in the target configuration are deleted, completing the **pruning** phase.
It then continues with the **create** phase:
```
create loopback interface instance 0
create loopback interface instance 1
create bond mode lacp load-balance l34 id 0
create vxlan tunnel src 192.0.2.1 dst 192.0.2.2 instance 1 vni 101 decap-next l2
create sub HundredGigabitEthernet12/0/0 1234 dot1q 1234 exact-match
create sub BondEthernet0 10 dot1q 10 exact-match
create sub BondEthernet0 100 dot1q 100
create sub BondEthernet0 200 dot1q 200
create sub BondEthernet0 500 dot1ad 500
create sub BondEthernet0 501 dot1ad 501
create sub HundredGigabitEthernet12/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match
create bridge-domain 1
create bridge-domain 11
lcp create HundredGigabitEthernet12/0/0 host-if ice12-0-0
lcp create BondEthernet0 host-if bond0
lcp create loop0 host-if lo0
lcp create loop1 host-if bvi1
lcp create HundredGigabitEthernet12/0/0.1234 host-if ice0.1234
lcp create BondEthernet0.10 host-if bond0.10
lcp create HundredGigabitEthernet12/0/0.1235 host-if ice0.1234.1000
```
Here, interfaces are created in order of loopbacks first, then BondEthernets, then Tunnels, and
finally sub-interfaces, first creating single-tagged and then creating dual-tagged sub-interfaces.
Of course, the BondEthernet has to be created before any sub-int will be able to be created on it.
Note that the QinQ `Hu12/0/0.1235` will be created after its intermediary parent `Hu12/0/0.1234`
due to this ordering requirement.
Then, the two new bridgedomains `bd1` and `bd11` are created, and finally the _LIP_ plumbing is
performed, starting with the PHY `ice12-0-0` and BondEthernet `bond0`, then the two loopbacks,
and only then advancing to the two single-tag dot1q interfaces and finally the QinQ interface. For
LCPs, this is very important, because in Linux, the interfaces are a tree, not a list. `ice12-0-0`
must be created before its child `ice0.1234@ice12-0-0` can be created, and only then can the QinQ
`ice0.1234.1000@ice0.1234` be created. This creation order follows from the DAG having an edge
signalling an LCP depending on the sub-interface, and an edge between the sub-interface with two
tags depending on the sub-interface with one tag, and an edge between the single-tagged sub-interface
depending on its PHY.
After all this work, `vppcfg` can assert (a) every object that now exists in VPP is in the
target configuration and (b) that any object that exists in the configuration also is present in
VPP (with the correct attributes).
But there's one last thing to do, and that's ensure that the attributes that can be changed at
runtime (IP addresses, L2XCs, BondEthernet and bridge-domain members, etc) , are **sync'd** into
their respective objects in VPP based on what's in the target configuration:
```
bond add BondEthernet0 GigabitEthernet3/0/0
bond add BondEthernet0 GigabitEthernet3/0/1
comment { ip link set bond0 address 00:25:90:0c:05:01 }
set interface l2 bridge loop1 1 bvi
set interface l2 bridge BondEthernet0.500 1
set interface l2 tag-rewrite BondEthernet0.500 pop 1
set interface l2 bridge BondEthernet0.501 1
set interface l2 tag-rewrite BondEthernet0.501 pop 1
set interface l2 bridge HundredGigabitEthernet12/0/1 1
set interface l2 tag-rewrite HundredGigabitEthernet12/0/1 disable
set interface l2 bridge vxlan_tunnel1 1
set interface l2 tag-rewrite vxlan_tunnel1 disable
set interface l2 xconnect BondEthernet0.100 BondEthernet0.200
set interface l2 tag-rewrite BondEthernet0.100 pop 1
set interface l2 xconnect BondEthernet0.200 BondEthernet0.100
set interface l2 tag-rewrite BondEthernet0.200 pop 1
set interface state GigabitEthernet3/0/1 down
set interface mtu 9000 GigabitEthernet3/0/1
set interface state GigabitEthernet3/0/1 up
set interface mtu packet 9000 GigabitEthernet3/0/0
set interface mtu packet 9000 HundredGigabitEthernet12/0/0
set interface mtu packet 2000 HundredGigabitEthernet12/0/1
set interface mtu packet 2000 vxlan_tunnel1
set interface mtu packet 1500 loop0
set interface mtu packet 1500 loop1
set interface mtu packet 9000 GigabitEthernet3/0/1
set interface mtu packet 1200 HundredGigabitEthernet12/0/0.1234
set interface mtu packet 3000 BondEthernet0.10
set interface mtu packet 2500 BondEthernet0.100
set interface mtu packet 2500 BondEthernet0.200
set interface mtu packet 2000 BondEthernet0.500
set interface mtu packet 2000 BondEthernet0.501
set interface mtu packet 1100 HundredGigabitEthernet12/0/0.1235
set interface state GigabitEthernet3/0/0 down
set interface mtu 9000 GigabitEthernet3/0/0
set interface state GigabitEthernet3/0/0 up
set interface state HundredGigabitEthernet12/0/0 down
set interface mtu 9000 HundredGigabitEthernet12/0/0
set interface state HundredGigabitEthernet12/0/0 up
set interface state HundredGigabitEthernet12/0/1 down
set interface mtu 2000 HundredGigabitEthernet12/0/1
set interface state HundredGigabitEthernet12/0/1 up
set interface ip address HundredGigabitEthernet12/0/0 192.0.2.17/30
set interface ip address HundredGigabitEthernet12/0/0 2001:db8:3::1/64
set interface ip address loop0 10.0.0.1/32
set interface ip address loop0 2001:db8::1/128
set interface ip address loop1 10.0.1.1/24
set interface ip address loop1 2001:db8:1::1/64
set interface state HundredGigabitEthernet12/0/0.1234 up
set interface state HundredGigabitEthernet12/0/0.1235 up
set interface state BondEthernet0 up
set interface state BondEthernet0.10 up
set interface state BondEthernet0.100 up
set interface state BondEthernet0.200 up
set interface state BondEthernet0.500 up
set interface state BondEthernet0.501 up
set interface state vxlan_tunnel1 up
set interface state loop0 up
set interface state loop1 up
```
I'm not gonna lie, it's a tonne of work, but it's all a pretty staight forward juggle. The sync
phase will look at each object in the config and ensure that the attributes that same object has in the
dataplane are present and correct. In my demo, `hippo12.yaml` creates a lot of interfaces and IP
addresses, and changes the MTU of pretty much every interface, but in order:
* The bondethernet gets its members `Gi3/0/0` and `Gi3/0/1`. As an interesting aside, when VPP
creates a BondEthernet it'll initially assign it an ephemeral MAC address. Then, when its first
member is added, the MAC address of the BondEthernet will change to that of the first member.
The comment reminds me to also set this MAC on the Linux device `bond0`. In the future, I'll add
some `PyRoute2` code to do that automatically.
* BridgeDomains are next. The BVI `loop1` is added first, then a few sub-interfaces and a tunnel,
and VLAN tag-rewriting for tagged interfaces is configured. There are two bridges, but only one
of them has members, so there's not much (in fact, there's nothing) to do for the other one.
* L2 Cross Connects can be changed at runtime, and they're next. The two interfaces `BE0.100` and
`BE0.200` are connected to one another and tag-rewrites are set up for them, considering they
are both tagged sub-interfaces.
* MTU is next. There's two variants of this. The first one `set interface mtu` is actually a
change in the DPDK driver to change the maximum allowable frame size. For this change, some
interface types have to be brought down first, the max frame size changed, and then brought back
up again. For all the others, the MTU will be changed in a specific order:
1. PHYs will grow their MTU first, as growing a PHY is guaranteed to be always safe.
1. Sub-interfaces will shrink QinX first, then Dot1Q/Dot1AD, then untagged interfaces. This is
to ensure we do not leave VPP and LinuxCP in a state where a QinQ sub-int has a higher MTU
than any of its parents.
1. Sub-interfaces will grow untagged first, then DOt1Q/Dot1AD, and finally QinX sub-interfaces.
Same reason as step 2, no sub-interface will end up with a higher MTU than any of its
parents.
1. PHYs will shrink their MTU last. The YAML configuration validation asserts that no PHY can
have an MTU lower than any of its children, so this is safe.
* Finally, IP addresses are added to `Hu12/0/0`, `loop0` and `loop1`. I can guarantee that adding
IP addresses will not clash with any other interface, because pruning would've removed IP
addresses from interfaces where they don't belong previously.
* And to finish off, the admin state for interfaces is set, again going from PHY, Bond, Tunnel,
1-tagged sub-interfaces and finally 2-tagged sub-interfaces and loopbacks.
Let's take it to the test:
```
pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo12.yaml plan -o hippo4-to-12.exec
[INFO ] root.main: Loading configfile hippo12.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823
[INFO ] vppcfg.reconciler.write: Wrote 94 lines to hippo4-to-12.exec
[INFO ] root.main: Planning succeeded
pim@hippo:~/src/vppcfg$ vppctl exec ~/src/vppcfg/hippo4-to-12.exec
pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo12.yaml plan
[INFO ] root.main: Loading configfile hippo12.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823
[INFO ] root.main: Planning succeeded
```
Notice that after applying `hippo4-to-12.exec`, the planner had nothing else to say. VPP is now in
the target configuration state, slick!
### Demo 3: Returning VPP to empty
This one is easy, but shows the pruning in action. Let's say I wanted to return VPP to a default
configuration without any objects, and its interfaces all at MTU 1500:
```
interfaces:
GigabitEthernet3/0/0:
mtu: 1500
description: Not Used
GigabitEthernet3/0/1:
mtu: 1500
description: Not Used
HundredGigabitEthernet12/0/0:
mtu: 1500
description: Not Used
HundredGigabitEthernet12/0/1:
mtu: 1500
description: Not Used
```
Simply applying that plan:
```
pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo-empty.yaml plan -o 12-to-empty.exec
[INFO ] root.main: Loading configfile hippo-empty.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823
[INFO ] vppcfg.reconciler.write: Wrote 66 lines to 12-to-empty.exec
[INFO ] root.main: Planning succeeded
pim@hippo:~/src/vppcfg$ vppctl
vpp# exec ~/src/vppcfg/12-to-empty.exec
vpp# show interface
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
GigabitEthernet3/0/0 1 up 1500/0/0/0
GigabitEthernet3/0/1 2 up 1500/0/0/0
HundredGigabitEthernet12/0/0 3 up 1500/0/0/0
HundredGigabitEthernet12/0/1 4 up 1500/0/0/0
local0 0 down 0/0/0/0
```
### Final notes
Now you may have been wondering why I would call the first file `hippo4.yaml` and the second one
`hippo12.yaml`. This is because I have 20 such YAML files that bring Hippo into all sorts of
esoteric configuration states, and I do this so that I can do a full integration test of any config
morphing into any other config:
```
for i in hippo[0-9]*.yaml; do
echo "Clearing: Moving to hippo-empty.yaml"
./vppcfg -c hippo-empty.yaml > /tmp/vppcfg-exec-empty
[ -s /tmp/vppcfg-exec-empty ] && vppctl exec /tmp/vppcfg-exec-empty
for j in hippo[0-9]*.yaml; do
echo " - Moving to $i .. "
./vppcfg -c $i > /tmp/vppcfg-exec_$i
[ -s /tmp/vppcfg-exec_$i ] && vppctl exec /tmp/vppcfg-exec_$i
echo " - Moving from $i to $j"
./vppcfg -c $j > /tmp/vppcfg-exec_${i}_${j}
[ -s /tmp/vppcfg-exec_${i}_${j} ] && vppctl exec /tmp/vppcfg-exec_${i}_${j}
echo " - Checking that from $j to $j is empty"
./vppcfg -c $j > /tmp/vppcfg-exec_${j}_${j}_null
done
done
```
What this does is starts off Hippo with an empty config, then moves it to `hippo1.yaml` and from
there it moves the configuration to each YAML file and back to `hippo1.yaml`. Doing this proves,
that no matter which configuration I want to obtain, I can get there safely when the VPP dataplane
config starts out looking like what is described in `hippo1.yaml`. I'll then move it back to empty,
and into `hippo2.yaml`, doing the whole cycle again. So for 20 files, this means ~400 or so
configuration transitions. And some of these are special, notably moving from `hippoN.yaml` to
the same `hippoN.yaml` should result in zero diffs.
With this path planner reasonably well tested, I have pretty high confidence that `vppcfg` can
change the dataplane from any existing configuration to any desired target configuration.
## What's next
One thing that I didn't mention yet, is that the `vppcfg` path planner works by reading the API
configuration state exactly once (at startup), and then it figures out the CLI calls to print
without needing to talk to VPP again. This is super useful as it's a non-intrusive way to inspect
the changes before applying them, and it's a property I'd like to carry forward.
However, I don't necessarily think that emitting the CLI statements is the best user experience,
it's more for the purposes of analysis that they can be useful. What I really want to do is emit
API calls after the plan is created and reviewed/approved, directly reprogramming the VPP dataplane,
and likely the Linux network namespace interfaces as well, for example setting the MAC address of
a BondEthernet as I showed in that one comment above, or setting interface alias names based on
the configured descriptions.
However, the VPP API set needed to do this is not 100% baked yet. For example, I observed crashes
when tinkering with BVIs and Loopbacks ([thread](https://lists.fd.io/g/vpp-dev/message/21116)), and
fixed a few obvious errors in the Linux CP API ([gerrit](https://gerrit.fd.io/r/c/vpp/+/35479)) but
there are still a few more issues to work through before I can set the next step with `vppcfg`.
But for now, it's already helping me out tremendously at IPng Networks and I hope it'll be useful
for others, too.

View File

@ -0,0 +1,645 @@
---
date: "2022-10-14T19:52:11Z"
title: VPP Lab - Setup
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# Introduction
In a previous post ([VPP Linux CP - Virtual Machine Playground]({% post_url 2021-12-23-vpp-playground %})), I
wrote a bit about building a QEMU image so that folks can play with the [Vector Packet Processor](https://fd.io)
and the Linux Control Plane code. Judging by our access logs, this image has definitely been downloaded a bunch,
and I myself use it regularly when I want to tinker a little bit, without wanting to impact the production
routers at [AS8298]({% post_url 2021-02-27-network %}).
The topology of my tests has become a bit more complicated over time, and often just one router would not be
enough. Yet, repeatability is quite important, and I found myself constantly reinstalling / recheckpointing
the `vpp-proto` virtual machine I was using. I got my hands on some LAB hardware, so it's time for an upgrade!
## IPng Networks LAB - Physical
{{< image width="300px" float="left" src="/assets/lab/physical.png" alt="Physical" >}}
First, I specc'd out a few machines that will serve as hypervisors. From top to bottom in the picture here, two
FS.com S5680-20SQ switches -- I reviewed these earlier [[ref]({% post_url 2021-08-07-fs-switch %})], and I really
like these, as they come with 20x10G, 4x25G and 2x40G ports, an OOB management port and serial to configure them.
Under it, is its larger brother, with 48x10G and 8x100G ports, the FS.com S5860-48SC. Although it's a bit more
expensive, it's also necessary because I often test VPP at higher bandwidth, and as such being able to make
ethernet topologies by mixing 10, 25, 40, 100G is super useful for me. So, this switch is `fsw0.lab.ipng.ch`
and dedicated to lab experiments.
Connected to the switch are my trusty `Rhino` and `Hippo` machines. If you remember that game _Hungry Hungry Hippos_
that's where the name comes from. They are both Ryzen 5950X on ASUS B550 motherboard, with each 2x1G i350 copper
nics (pictured here not connected), and 2x100G i810 QSFP network cards (properly slotted in the motherboard'ss
PCIe v4.0 x16 slot).
Finally, three Dell R720XD machines serve as the to be built VPP testbed. They each come with 128GB of RAM, 2x500G
SSDs, two Intel 82599ES dual 10G NICs (four ports total), and four Broadcom BCM5720 1G NICs. The first 1G port is
connected to a management switch, and it doubles up as an IPMI speaker, so I can turn on/off the hypervisors
remotely. All four 10G ports are connected with DACs to `fsw0-lab`, as are two 1G copper ports (the blue UTP
cables). Everything can be turned on/off remotely, which is useful for noise, heat and overall the environment 🍀.
## IPng Networks LAB - Logical
{{< image width="200px" float="right" src="/assets/lab/logical.svg" alt="Logical" >}}
I have three of these Dell R720XD machines in the lab, and each one of them will run one complete lab environment,
consisting of four VPP virtual machines, network plumbing, and uplink. That way, I can turn on one hypervisor,
say `hvn0.lab.ipng.ch`, prepare and boot the VMs, mess around with it, and when I'm done, return the VMs to a
pristine state, and turn off the hypervisor. And, because I have three of these machines, I can run three separate
LABs at the same time, or one really big one spanning all the machines. Pictured on the right is a logical sketch
of one of the LABs (LAB id=0), with a bunch of VPP virtual machines, each four NICs daisychained together, with
a few NICs left for experimenting.
### Headend
At the top of the logical environment, I am going to be using one of our production machines (`hvn0.chbtl0.ipng.ch`)
which will run a permanently running LAB _headend_, a Debian VM called `lab.ipng.ch`. This allows me to hermetically
seal the LAB environments, letting me run them entirely in RFC1918 space, and by forcing the LAbs to be connected
under this machine, I can ensure that no unwanted traffic enters or exits the network [imagine a loadtest at
100Gbit accidentally leaking, this may or totally may not have once happened to me before ...].
### Disk images
On this production hypervisor (`hvn0.chbtl0.ipng.ch`), I'll also prepare and maintain a prototype `vpp-proto` disk
image, which will serve as a consistent image to boot the LAB virtual machines. This _main_ image will be replicated
over the network into all three `hvn0 - hvn2` hypervisor machines. This way, I can do periodical maintenance on the
_main_ `vpp-proto` image, snapshot it, publish it as a QCOW2 for downloading (see my [[VPP Linux CP - Virtual Machine
Playground]({% post_url 2021-12-23-vpp-playground %})] post for details on how it's built and what you can do with it
yourself!). The snapshots will then also be sync'd to all hypervisors, and from there I can use simple ZFS filesystem
_cloning_ and _snapshotting_ to maintain the LAB virtual machines.
### Networking
Each hypervisor will get an install of [Open vSwitch](https://openvswitch.org/), a production quality, multilayer virtual switch designed to
enable massive network automation through programmatic extension, while still supporting standard management interfaces
and protocols. This takes lots of the guesswork and tinkering out of Linux bridges in KVM/QEMU, and it's a perfect fit
due to its tight integration with `libvirt` (the thing most of us use in Debian/Ubuntu hypervisors). If need be, I can
add one or more of the 1G or 10G ports as well to the OVS fabric, to build more complicated topologies. And, because
the OVS infrastructure and libvirt both allow themselves to be configured over the network, I can control all aspects
of the runtime directly from the `lab.ipng.ch` headend, not having to log in to the hypervisor machines at all. Slick!
# Implementation Details
I start with image management. On the production hypervisor, I create a 6GB ZFS dataset that will serve as my `vpp-proto`
machine, and install it using the exact same method as the playground [[ref]({% post_url 2021-12-23-vpp-playground %})].
Once I have it the way I like it, I'll poweroff the VM, and see to this image being replicated to all hypervisors.
## ZFS Replication
Enter [zrepl](https://zrepl.github.io/), a one-stop, integrated solution for ZFS replication. This tool is incredibly
powerful, and can do snapshot management, sourcing / sinking replication, of course using incremental snapshots as they
are native to ZFS. Because this is a LAB article, not a zrepl tutorial, I'll just cut to the chase and show the
configuration I came up with.
```
pim@hvn0-chbtl0:~$ cat << EOF | sudo tee /etc/zrepl/zrepl.yml
global:
logging:
# use syslog instead of stdout because it makes journald happy
- type: syslog
format: human
level: warn
jobs:
- name: snap-vpp-proto
type: snap
filesystems:
'ssd-vol0/vpp-proto-disk0<': true
snapshotting:
type: manual
pruning:
keep:
- type: last_n
count: 10
- name: source-vpp-proto
type: source
serve:
type: stdinserver
client_identities:
- "hvn0-lab"
- "hvn1-lab"
- "hvn2-lab"
filesystems:
'ssd-vol0/vpp-proto-disk0<': true # all filesystems
snapshotting:
type: manual
EOF
pim@hvn0-chbtl0:~$ cat << EOF | sudo tee -a /root/.ssh/authorized_keys
# ZFS Replication Clients for IPng Networks LAB
command="zrepl stdinserver hvn0-lab",restrict ecdsa-sha2-nistp256 <omitted> root@hvn0.lab.ipng.ch
command="zrepl stdinserver hvn1-lab",restrict ecdsa-sha2-nistp256 <omitted> root@hvn1.lab.ipng.ch
command="zrepl stdinserver hvn2-lab",restrict ecdsa-sha2-nistp256 <omitted> root@hvn2.lab.ipng.ch
EOF
```
To unpack this, there are two jobs configured in **zrepl**:
* `snap-vpp-proto` - the purpose of this job is to track snapshots as they are created. Normally, zrepl is configured
to automatically make snapshots every hour and copy them out, but in my case, I only want to take snapshots when I changed
and released the `vpp-proto` image, not periodically. So, I set the snapshotting to manual, and let the system keep the last
ten images.
* `source-vpp-proto` - this is a source job that uses a _lazy_ (albeit fine in this lab environment) method to serve the
snapshots to clients. By adding these SSH keys to the _authorized_keys_ file, but restricting them to be able to execute
only the `zrepl stdinserver` command, and nothing else (ie. these keys cannot log in to the machine). If any given server
were to present thesze keys, I can now map them to a **zrepl client** (for example, `hvn0-lab` for the SSH key presented by
hostname `hvn0.lab.ipng.ch`. The source job now knows to serve the listed filesystems (and their dataset children, noted by
the `<` suffix), to those clients.
For the client side, each of the hypervisors gets only one job, called a _pull_ job, which will periodically wake up (every
minute) and ensure that any pending snapshots and their incrementals from the remote _source_ are slurped in and replicated
to a _root_fs_ dataset, in this case I called it `ssd-vol0/hvn0.chbtl0.ipng.ch` so I can track where the datasets come from.
```
pim@hvn0-lab:~$ sudo ssh-keygen -t ecdsa -f /etc/zrepl/ssh/identity -C "root@$(hostname -f)"
pim@hvn0-lab:~$ cat << EOF | sudo tee /etc/zrepl/zrepl.yml
global:
logging:
# use syslog instead of stdout because it makes journald happy
- type: syslog
format: human
level: warn
jobs:
- name: vpp-proto
type: pull
connect:
type: ssh+stdinserver
host: hvn0.chbtl0.ipng.ch
user: root
port: 22
identity_file: /etc/zrepl/ssh/identity
root_fs: ssd-vol0/hvn0.chbtl0.ipng.ch
interval: 1m
pruning:
keep_sender:
- type: regex
regex: '.*'
keep_receiver:
- type: last_n
count: 10
recv:
placeholder:
encryption: off
```
After restarting zrepl for each of the machines (the _source_ machine and the three _pull_ machines), I can now do the
following cool hat trick:
```
pim@hvn0-chbtl0:~$ virsh start --console vpp-proto
## Do whatever maintenance, and then poweroff the VM
pim@hvn0-chbtl0:~$ sudo zfs snapshot ssd-vol0/vpp-proto-disk0@20221019-release
pim@hvn0-chbtl0:~$ sudo zrepl signal wakeup source-vpp-proto
```
This signals the zrepl daemon to re-read the snapshots, which will pick up the newest one, and then without me doing
much of anything else:
```
pim@hvn0-lab:~$ sudo zfs list -t all | grep vpp-proto
ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0 6.60G 367G 6.04G -
ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221013-release 499M - 6.04G -
ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221018-release 24.1M - 6.04G -
ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221019-release 0B - 6.04G -
```
That last image was just pushed automatically to all hypervisors! If they're turned off, no worries, as soon as they
start up, their local **zrepl** will make its next minutely poll, and pull in all snapshots, bringing the machine up
to date. So even when the hypervisors are normally turned off, this is zero-touch and maintenance free.
## VM image maintenance
Now that I have a stable image to work off of, all I have to do is `zfs clone` this image into new per-VM datasets,
after which I can mess around on the VMs all I want, and when I'm done, I can `zfs destroy` the clone and bring it
back to normal. However, I clearly don't want one and the same clone for each of the VMs, as they do have lots of
config files that are specific to that one _instance_. For example, the mgmt IPv4/IPv6 addresses are unique, and
the VPP and Bird/FRR configs are unique as well. But how unique are they, really?
Enter Jinja (known mostly from Ansible). I decide to make some form of per-VM config files that are generated based
on some templates. That way, I can clone the base ZFS dataset, copy in the deltas, and boot that instead. And to
be extra efficient, I can also make a per-VM `zfs snapshot` of the cloned+updated filesystem, before tinkering with
the VMs, which I'll call a `pristine` snapshot. Still with me?
1. First, clone the base dataset into a per-VM dataset, say `ssd-vol0/vpp0-0`
1. Then, generate a bunch of override files, copying them into the per-VM dataset `ssd-vol0/vpp0-0`
1. Finally, create a snapshot of that, called `ssd-vol0/vpp0-0@pristine` and boot off of that.
Now, returning the VM to a pristine state is simply a matter of shutting down the VM, performing a `zfs rollback`
to the `pristine` snapshot, and starting the VM again. Ready? Let's go!
### Generator
So off I go, writing a small Python generator that uses Jinja to read a bunch of YAML files, merging them along
the way, and then traversing a set of directories with template files and per-VM overrides, to assemble a build
output directory with a fully formed set of files that I can copy into the per-VM dataset.
Take a look at this as a minimally viable configuration:
```
pim@lab:~/src/lab$ cat config/common/generic.yaml
overlays:
default:
path: overlays/bird/
build: build/default/
lab:
mgmt:
ipv4: 192.168.1.80/24
ipv6: 2001:678:d78:101::80/64
gw4: 192.168.1.252
gw6: 2001:678:d78:101::1
nameserver:
search: [ "lab.ipng.ch", "ipng.ch", "rfc1918.ipng.nl", "ipng.nl" ]
nodes: 4
pim@lab:~/src/lab$ cat config/hvn0.lab.ipng.ch.yaml
lab:
id: 0
ipv4: 192.168.10.0/24
ipv6: 2001:678:d78:200::/60
nameserver:
addresses: [ 192.168.10.4, 2001:678:d78:201::ffff ]
hypervisor: hvn0.lab.ipng.ch
```
Here I define a common config file with fields and attributes which will apply to all LAB environments, things
such as the mgmt network, nameserver search paths, and how many VPP virtual machine nodes I want to build. Then,
for `hvn0.lab.ipng.ch`, I specify an IPv4 and IPv6 prefix assigned to it, some specific nameserver endpoints
that will point at an `unbound` running on `lab.ipng.ch` itself.
I can now create any file I'd like which may use variable substition and other jinja2 style templating. Take
for example these two files:
{% raw %}
```
pim@lab:~/src/lab$ cat overlays/bird/common/etc/netplan/01-netcfg.yaml.j2
network:
version: 2
renderer: networkd
ethernets:
enp1s0:
optional: true
accept-ra: false
dhcp4: false
addresses: [ {{node.mgmt.ipv4}}, {{node.mgmt.ipv6}} ]
gateway4: {{lab.mgmt.gw4}}
gateway6: {{lab.mgmt.gw6}}
pim@lab:~/src/lab$ cat overlays/bird/common/etc/netns/dataplane/resolv.conf.j2
domain lab.ipng.ch
search{% for domain in lab.nameserver.search %} {{domain}}{%endfor %}
{% for resolver in lab.nameserver.addresses %}
nameserver {{resolver}}
{%endfor%}
```
{% endraw %}
The first file is a [[NetPlan.io](https://netplan.io/)] configuration that substitutes the correct management
IPv4 and IPv6 addresses and gateways. The second one enumerates a set of search domains and nameservers, so that
each LAB can have their own unique resolvers. I point these at the `lab.ipng.ch` uplink interface, in the case
of the LAB `hvn0.lab.ipng.ch`, this will be 192.168.10.4 and 2001:678:d78:201::ffff, but on `hvn1.lab.ipng.ch`
I can override that to become 192.168.11.4 and 2001:678:d78:211::ffff.
There's one subdirectory for each _overlay_ type (imagine that I want a lab that runs Bird2, but I may also
want one which runs FRR, or another thing still). Within the _overlay_ directory, there's one _common_
tree, with files that apply to every machine in the LAB, and a _hostname_ tree, with files that apply
only to specific nodes (VMs) in the LAB:
```
pim@lab:~/src/lab$ tree overlays/default/
overlays/default/
├── common
│   ├── etc
│   │   ├── bird
│   │   │   ├── bfd.conf.j2
│   │   │   ├── bird.conf.j2
│   │   │   ├── ibgp.conf.j2
│   │   │   ├── ospf.conf.j2
│   │   │   └── static.conf.j2
│   │   ├── hostname.j2
│   │   ├── hosts.j2
│   │   ├── netns
│   │   │   └── dataplane
│   │   │   └── resolv.conf.j2
│   │   ├── netplan
│   │   │   └── 01-netcfg.yaml.j2
│   │   ├── resolv.conf.j2
│   │   └── vpp
│   │   ├── bootstrap.vpp.j2
│   │   └── config
│   │   ├── defaults.vpp
│   │   ├── flowprobe.vpp.j2
│   │   ├── interface.vpp.j2
│   │   ├── lcp.vpp
│   │   ├── loopback.vpp.j2
│   │   └── manual.vpp.j2
│   ├── home
│   │   └── ipng
│   └── root
├── hostname
   ├── vpp0-0
      └── etc
(etc)   └── vpp
      └── config
      └── interface.vpp
```
Now all that's left to do is generate this hierarchy, and of course I can check this in to git and track changes to the
templates and their resulting generated filesystem overrides over time:
```
pim@lab:~/src/lab$ ./generate -q --host hvn0.lab.ipng.ch
pim@lab:~/src/lab$ find build/default/hvn0.lab.ipng.ch/vpp0-0/ -type f
build/default/hvn0.lab.ipng.ch/vpp0-0/home/ipng/.ssh/authorized_keys
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/hosts
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/resolv.conf
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/static.conf
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/bfd.conf
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/bird.conf
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/ibgp.conf
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/ospf.conf
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/loopback.vpp
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/flowprobe.vpp
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/interface.vpp
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/defaults.vpp
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/lcp.vpp
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/manual.vpp
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/bootstrap.vpp
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/netplan/01-netcfg.yaml
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/netns/dataplane/resolv.conf
build/default/hvn0.lab.ipng.ch/vpp0-0/etc/hostname
build/default/hvn0.lab.ipng.ch/vpp0-0/root/.ssh/authorized_keys
```
## Open vSwitch maintenance
The OVS installs on each Debian hypervisor in the lab is the same. I install the required Debian packages, create
a switchfabric, add one physical network port (the one that will serve as the _uplink_ (VLAN 10 in the sketch above)
for the LAB), and all the virtio ports from KVM.
```
pim@hvn0-lab:~$ sudo vi /etc/netplan/01-netcfg.yaml
network:
vlans:
uplink:
optional: true
accept-ra: false
dhcp4: false
link: eno1
id: 200
pim@hvn0-lab:~$ sudo netplan apply
pim@hvn0-lab:~$ sudo apt install openvswitch-switch python3-openvswitch
pim@hvn0-lab:~$ sudo ovs-vsctl add-br vpplan
pim@hvn0-lab:~$ sudo ovs-vsctl add-port vpplan uplink tag=10
```
The `vpplan` switch fabric and its uplink port will persist across reboots. Then I add a small change to `libvirt`
defined virtual machines:
```
pim@hvn0-lab:~$ virsh edit vpp0-0
...
<interface type='bridge'>
<mac address='52:54:00:00:10:00'/>
<source bridge='vpplan'/>
<virtualport type='openvswitch' />
<target dev='vpp0-0-0'/>
<model type='virtio'/>
<mtu size='9216'/>
<address type='pci' domain='0x0000' bus='0x10' slot='0x00' function='0x0' multifunction='on'/>
</interface>
<interface type='bridge'>
<mac address='52:54:00:00:10:01'/>
<source bridge='vpplan'/>
<virtualport type='openvswitch' />
<target dev='vpp0-0-1'/>
<model type='virtio'/>
<mtu size='9216'/>
<address type='pci' domain='0x0000' bus='0x10' slot='0x00' function='0x1'/>
</interface>
... etc
```
That the only two things I need to do are ensure that the _source bridge_ will be called the same as
the OVS fabric, in my case `vpplan`, and the _virtualport_ type is `openvswitch`, and that's it!
Once all four `vpp0-*` virtual machines each have all four of their network cards updated, when they
boot, the hypervisor will add them each as new untagged ports in the OVS fabric.
To then build the topology that I have in mind for the LAB, where each VPP machine is daisychained to
its siblin, all we have to do is program that into the OVS configuration:
```
pim@hvn0-lab:~$ cat << EOF > ovs-config.sh
#!/bin/sh
#
# OVS configuration for the `default` overlay
LAB=${LAB:=0}
for node in 0 1 2 3; do
for int in 0 1 2 3; do
ovs-vsctl set port vpp${LAB}-${node}-${int} vlan_mode=native-untagged
done
done
# Uplink is VLAN 10
ovs-vsctl add port vpp${LAB}-0-0 tag 10
ovs-vsctl add port uplink tag 10
# Link vpp${LAB}-0 <-> vpp${LAB}-1 in VLAN 20
ovs-vsctl add port vpp${LAB}-0-1 tag 20
ovs-vsctl add port vpp${LAB}-1-0 tag 20
# Link vpp${LAB}-1 <-> vpp${LAB}-2 in VLAN 21
ovs-vsctl add port vpp${LAB}-1-1 tag 21
ovs-vsctl add port vpp${LAB}-2-0 tag 21
# Link vpp${LAB}-2 <-> vpp${LAB}-3 in VLAN 22
ovs-vsctl add port vpp${LAB}-2-1 tag 22
ovs-vsctl add port vpp${LAB}-3-0 tag 22
EOF
pim@hvn0-lab:~$ chmod 755 ovs-config.sh
pim@hvn0-lab:~$ sudo ./ovs-config.sh
```
The first block here wheels over all nodes and then for all of ther ports, sets the VLAN mode to what
OVS calleds 'native-untagged'. In this mode, the `tag` becomes the VLAN in which the port will operate,
but, to add as well dot1q tagged additional VLANs, we can use the syntax `add port ... trunks 10,20,30`.
TO see the configuration, `ovs-vsctl show port vpp0-0-0` will show the switch port configuration, while
`ovs-vsctl show interface vpp0-0-0` will show the virtual machine's NIC configuration (think of the
difference here as the switch port on the one hand, and the NIC (interface) plugged into it on the other).
### Deployment
There's three main points to consider when deploying these lab VMs:
1. Create the VMs and their ZFS datasets
1. Destroy the VMs and their ZFS datasets
1. Bring the VMs into a pristine state
#### Create
If the hypervisor doesn't yet have a LAB running, we need to create it:
```
BASE=${BASE:=ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221019-release}
BUILD=${BUILD:=default}
LAB=${LAB:=0}
## Do not touch below this line
LABDIR=/var/lab
STAGING=$LABDIR/staging
HVN="hvn${LAB}.lab.ipng.ch"
echo "* Cloning base"
ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \
mkdir -p $STAGING/\$VM; zfs clone $BASE ssd-vol0/\$VM; done"
sleep 1
echo "* Mounting in staging"
ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \
mount /dev/zvol/ssd-vol0/\$VM-part1 $STAGING/\$VM; done"
echo "* Rsyncing build"
rsync -avugP build/$BUILD/$HVN/ root@hvn${LAB}.lab.ipng.ch:$STAGING
echo "* Setting permissions"
ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \
chown -R root. $STAGING/\$VM/root; done"
echo "* Unmounting and snapshotting pristine state"
ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \
umount $STAGING/\$VM; zfs snapshot ssd-vol0/\${VM}@pristine; done"
echo "* Starting VMs"
ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \
virsh start \$VM; done"
echo "* Committing OVS config"
scp overlays/$BUILD/ovs-config.sh root@$HVN:$LABDIR
ssh root@$HVN "set -x; LAB=$LAB $LABDIR/ovs-config.sh"
```
After running this, the hypervisor will have 4 clones, and 4 snapshots (one for each virtual machine):
```
root@hvn0-lab:~# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
ssd-vol0 6.80G 367G 24K /ssd-vol0
ssd-vol0/hvn0.chbtl0.ipng.ch 6.60G 367G 24K none
ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0 6.60G 367G 24K none
ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0 6.60G 367G 6.04G -
ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221013-release 499M - 6.04G -
ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221018-release 24.1M - 6.04G -
ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221019-release 0B - 6.04G -
ssd-vol0/vpp0-0 43.6M 367G 6.04G -
ssd-vol0/vpp0-0@pristine 1.13M - 6.04G -
ssd-vol0/vpp0-1 25.0M 367G 6.04G -
ssd-vol0/vpp0-1@pristine 1.14M - 6.04G -
ssd-vol0/vpp0-2 42.2M 367G 6.04G -
ssd-vol0/vpp0-2@pristine 1.13M - 6.04G -
ssd-vol0/vpp0-3 79.1M 367G 6.04G -
ssd-vol0/vpp0-3@pristine 1.13M - 6.04G -
```
The last thing the create script does is commit the OVS configuration, because when the VMs are shutdown
or newly created, KVM will add them to the switching fabric as untagged/unconfigured ports.
But would you look at that! The delta between the base image and the `pristine` snapshots is about 1MB of
configuration files, the ones that I generated and rsync'd in above, and then once the machine boots, it
will have a read/write mounted filesystem as per normal, except it's a delta on top of the snapshotted,
cloned dataset.
#### Destroy
I love destroying things! But in this case, I'm removing what are essentially ephemeral disk images, as
I still have the base image to clone from. But, the destroy is conceptually very simple:
```
BASE=${BASE:=ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221018-release}
LAB=${LAB:=0}
## Do not touch below this line
HVN="hvn${LAB}.lab.ipng.ch"
echo "* Destroying VMs"
ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \
virsh destroy \$VM; done"
echo "* Destroying ZFS datasets"
ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \
zfs destroy -r ssd-vol0/\$VM; done"
```
After running this, the VMs will be shutdown and their cloned filesystems (including any snapshots
those may have) are wiped. To get back into a working state, all I must do is run `./create` again!
#### Pristine
Sometimes though, I don't need to completely destroy the VMs, but rather I want to put them back into
the state they where just after creating the LAB. Luckily, the create made a snapshot (called `pristine`)
for each VM before booting it, so bringing the LAB back to _factory default_ settings is really easy:
```
BUILD=${BUILD:=default}
LAB=${LAB:=0}
## Do not touch below this line
LABDIR=/var/lab
STAGING=$LABDIR/staging
HVN="hvn${LAB}.lab.ipng.ch"
## Bring back into pristine state
echo "* Restarting VMs from pristine snapshot"
ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \
virsh destroy \$VM;
zfs rollback ssd-vol0/\${VM}@pristine;
virsh start \$VM; done"
echo "* Committing OVS config"
scp overlays/$BUILD/ovs-config.sh root@$HVN:$LABDIR
ssh root@$HVN "set -x; $LABDIR/ovs-config.sh"
```
## Results
After completing this project, I have a completely hands-off, automated and autogenerated, and very maneageable set
of three LABs, each booting up in a running OSPF/OSPFv3 enabled topology for IPv4 and IPv6:
```
pim@lab:~/src/lab$ traceroute -q1 vpp0-3
traceroute to vpp0-3 (192.168.10.3), 30 hops max, 60 byte packets
1 e0.vpp0-0.lab.ipng.ch (192.168.10.5) 1.752 ms
2 e0.vpp0-1.lab.ipng.ch (192.168.10.7) 4.064 ms
3 e0.vpp0-2.lab.ipng.ch (192.168.10.9) 5.178 ms
4 vpp0-3.lab.ipng.ch (192.168.10.3) 7.469 ms
pim@lab:~/src/lab$ ssh ipng@vpp0-3
ipng@vpp0-3:~$ traceroute6 -q1 vpp2-3
traceroute to vpp2-3 (2001:678:d78:220::3), 30 hops max, 80 byte packets
1 e1.vpp0-2.lab.ipng.ch (2001:678:d78:201::3:2) 2.088 ms
2 e1.vpp0-1.lab.ipng.ch (2001:678:d78:201::2:1) 6.958 ms
3 e1.vpp0-0.lab.ipng.ch (2001:678:d78:201::1:0) 8.841 ms
4 lab0.lab.ipng.ch (2001:678:d78:201::ffff) 7.381 ms
5 e0.vpp2-0.lab.ipng.ch (2001:678:d78:221::fffe) 8.304 ms
6 e0.vpp2-1.lab.ipng.ch (2001:678:d78:221::1:21) 11.633 ms
7 e0.vpp2-2.lab.ipng.ch (2001:678:d78:221::2:22) 13.704 ms
8 vpp2-3.lab.ipng.ch (2001:678:d78:220::3) 15.597 ms
```
If you read this far, thanks! Each of these three LABs come with 4x10Gbit DPDK based packet generators (Cisco T-Rex),
four VPP machines running either Bird2 or FRR, and together they are connected to a 100G capable switch.
**These LABs are for rent, and we offer hands-on training on them.** Please **[contact](/s/contact/)** us for
daily/weekly rates, and custom training sessions.
I checked the generator and deploy scripts in to a git repository, which I'm happy to share if there's
an interest. But because it contains a few implementation details and doesn't do a lot of fool-proofing, as well as
because most of this can be easily recreated by interested parties from this blogpost, I decided not to publish
the LAB project github, but on our private git.ipng.ch server instead. Mail us if you'd like to take a closer look,
I'm happy to share the code.

View File

@ -0,0 +1,251 @@
---
date: "2022-11-20T22:35:14Z"
title: Mastodon - Part 1 - Installing
---
# About this series
{{< image width="200px" float="right" src="/assets/mastodon/mastodon-logo.svg" alt="Mastodon" >}}
I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I've been feeling less
enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using "free" services
is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but
for me it's time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to
privately operated ones.
This series details my findings starting a micro blogging website, which uses a new set of super interesting open interconnect protocols to
share media (text, pictures, videos, etc) between producers and their followers, using an open source project called
[Mastodon](https://joinmastodon.org/).
## Introduction
Similar to how blogging is the act of publishing updates to a website, microblogging is the act of publishing small updates to a stream of
updates on your profile. You can publish text posts and optionally attach media such as pictures, audio, video, or polls. Mastodon lets you
follow friends and discover new ones. It doesn't do this in a centralized way, however.
Groups of people congregate on a given server, of which they become a user by creating an account on that server. Then, they interact with
one another on that server, but users can also interact with folks on _other_ servers. Instead of following **@IPngNetworks**, they might
follow a user on a given server domain, like **@IPngNetworks@ublog.tech**. This way, all these servers can be run _independently_ but
interact with each other using a common protocol (called ActivityPub). I've heard this concept be compared to choosing an e-mail provider: I
might choose Google's gmail.com, and you might use Microsoft's live.com. However we can send e-mails back and forth due to this common
protocol (called SMTP).
### uBlog.tech
I thought I would give it a go, mostly out of engineering curiosity but also because I more strongly feel today that we (the users) ought to
take a bit more ownership back. I've been a regular blogging and micro-blogging user since approximately for ever, and I think it may be a
good investment of my time to learn a bit more about the architecture of Mastodon. So, I've decided to build and _productionize_ a server
instance.
I registered [uBlog.tech](https://ublog.tech). Incidentally, if you're reading this and would like to participate, the server welcomes users
in the network-, systems- and software engineering disciplines. But, before I can get to the fun parts though, I have to do a bunch of work
to get this server in a shape in which it can be trusted with user generated content.
### Hardware
I'm running Debian on (a set of) Dell R720s hosted by IPng Networks in Zurich, Switzerland. These machines are all roughly the same, and
come with:
* 2x10C/10T Intel E5-2680 (so 40 CPUs)
* 256GB ECC RAM
* 2x240G SSD in mdraid to boot from
* 3x1T SSD in ZFS for fast storage
* 6x16T harddisk with 2x500G SSD for L2ARC, in ZFS for bulk storage
Data integrity and durability is important to me. It's the one thing that typically the commercial vendors do really well, and my pride
prohibits me from losing data due to things like "disk failure" or "computer broken" or "datacenter on fire". So, I handle backups in two
main ways: borg(1) and zrepl(1).
* **Hypervisor hosts** make a daily copy of their entire filesystem using **borgbackup(1)** to a set of two remote fileservers. This way, the
important file metadata, configs for the virtual machines, and so on, are all safely stored remotely.
* **Virtual machines** are running on ZFS blockdevices on either the SSD pool, or the disk pool, or both. Using a tool called **zrepl(1)**
(which I described a little bit in a [[previous post]({% post_url 2022-10-14-lab-1 %})]), I create a snapshot every 12hrs on the local
blockdevice, and incrementally copy away those snapshots daily to the remote fileservers.
If I do something silly on a given virtual machine, I can roll back the machine filesystem state to the previous checkpoint and reboot. This has
saved my butt a number of times, during say a PHP 7 to 8 upgrade for Librenms, or during an OpenBSD upgrade that ran out of disk midway
through. Being able to roll back to a last known good state is awesome, and completely transparent for the virtual machine, as the
snapshotting is done on the underlying storage pool in the hypervisor. The fileservers run physically separated from the server pools, one in
Zurich and another in Geneva, so this way, if I were to lose the entire machine, I still have a ~12h old backup in two locations.
### Software
I provision a VM with 8vCPUs (dedicated on the underlying hypervisor), including 16GB of memory and two virtio network cards. One NIC will
connect to a backend LAN in some RFC1918 address space, and the other will present an IPv4 and IPv6 interface to the internet. I give this
machine two blockdevices, one small one of 16GB (vda) that is created on the hypervisor's `ssd-vol0/libvirt/ublog-disk0`, to be used only
for boot, logs and OS. Then, a second one (vdb) is created at 300GB on `ssd-vol1/libvirt/ublog-disk1` and it will be used for Mastodon and
its supporting services.
Then I simply install Debian into **vda** using `virt-install`. At IPng Networks we have some ansible-style automation that takes over the
machine, and further installs all sorts of Debian packages that we use (like a Prometheus node exporter, more on that later), and sets up a
firewall that allows SSH access for our trusted networks, and otherwise only allows port 80 and 443 because this is to be a webserver.
After installing Debian Bullseye, I'll create the following ZFS filesystems on **vdb**:
```
pim@ublog:~$ sudo zfs create -o mountpoint=/home/mastodon data/mastodon -V10G
pim@ublog:~$ sudo zfs create -o mountpoint=/var/lib/elasticsearch data/elasticsearch -V10G
pim@ublog:~$ sudo zfs create -o mountpoint=/var/lib/postgresql data/postgresql -V20G
pim@ublog:~$ sudo zfs create -o mountpoint=/var/lib/redis data/redis -V2G
pim@ublog:~$ sudo zfs create -o mountpoint=/home/mastodon/libve/public/system data/mastodon-system
```
As a sidenote, I realize that this ZFS filesystem pool consists only of **vdb**, but its underlying blockdevice is protected in a raidz, and
it is copied incrementally daily off-site by the hypervisor. I'm pretty confident on safety here, but I prefer to use ZFS for the virtual
machine guests as well, because now I can do local snapshotting, of say `data/mastodon-system`, and I can more easily grow/shrink the
datasets for the supporting services, as well as monitor them individually for wildgrowth.
#### Installing Mastodon
I then go through the public [Mastodon docs](https://docs.joinmastodon.org/admin/install/) to further install the machine. I choose not to
go the Docker route, but instead stick to systemd installs. The install itself is pretty straight forward, but I did find the nginx config
a bit rough around the edges (notably because the default files I'm asked to use have their ssl certificate stanza's commented out, while
trying to listen on port 443, and this makes nginx and certbot very confused). A cup of tea later, and we're all good.
I am not going to start prematurely optimizing, and after a very engaging thread on Mastodon itself
[[@davidlars@hachyderm.io](https://ublog.tech/@davidlars@hachyderm.io/109381163342345835)] with a few fellow admins, the consensus really is
to _KISS_ (keep it simple, silly!). In that thread, I made a few general observations on scaling up and out (none of which I'll be doing
initially), just by using some previous experience as a systems engineer, and knowing a bit about the components used here:
* Running services on dedicated machines (ie. saparate storage, postgres, Redis, Puma and Sidekiq workers)
* Fiddle with Puma worker pool (more workers, and/or more threads per worker)
* Fiddle with Sidekiq worker pool and dedicated instances per queue
* Put storage on local minio cluster
* Run multiple postgres databases, read-only replicas, or multimaster
* Run cluster of multiple redis instances instead of one
* Split off the cache redis into mem-only
* Frontend the service with a cluster of NGINX + object caching
Some other points of interest for those of us on the adventure of running our own machines follow:
#### Logging
Mastodon is a chatty one - it is logging to stdout/stderr and most of its tasks in Sidekiq have a lot to say. On Debian, by default this
output goes from **systemd** into **journald** which in turn copies it into **syslogd**. The result of this is that each logline hits the
disk three (!) times. And also by default, Debian and Ubuntu aren't too great at log hygiene. While `/var/log/` is scrubbed by logrotate(8),
nothing helps avoid the journal from growing unboundedly. So I quickly make the following change:
```
pim@ublog:~$ cat << EOF | sudo tee /etc/systemd/journald.conf
[Journal]
SystemMaxUse=500M
ForwardToSyslog=no
EOF
pim@ublog:~$ sudo systemctl restart systemd-journald
```
#### Paperclip and ImageMagick
I noticed while tailing the journal `journalctl -f` that lots of incoming media gets first spooled to /tmp and then run through a conversion
step to ensure the media is of the right format/aspect ratio. Mastodon calls a library called `paperclip` which in turn uses file(1) and
identify(1) to determine the type of file, and based on the answer for images runs convert(1) or ffmpeg(1) to munge it into the shape it
wants. I suspect that this will cause a fair bit of I/O in `/tmp` so something to keep in mind, is to either lazily turn that mountpoint
into a `tmpfs` (which is in general frowned upon), or to change the paperclip library to use a user-defined filesystem like `~mastodon/tmp`
and make _that_ a memory backed filesystem instead. The log signature in case you're curious:
```
Nov 20 21:02:10 ublog bundle[408189]: Command :: file -b --mime '/tmp/a22ab94adb939b0eb3c224bb9046c9cf20221123-408189-s0rsty.jpg'
Nov 20 21:02:10 ublog bundle[408189]: Command :: identify -format %m '/tmp/6205b887c6c337b1a72ae2a7ccb359c920221123-408189-e9jul1.jpg[0]'
Nov 20 21:02:10 ublog bundle[408189]: Command :: convert '/tmp/6205b887c6c337b1a72ae2a7ccb359c920221123-408189-e9jul1.jpg[0]' -auto-orient -resize "400x400>" -coalesce '/tmp/8ce2976b99d4b5e861e6c988459ee20c20221123-408189-1p5gg4'
Nov 20 21:02:10 ublog bundle[408189]: Command :: convert '/tmp/8ce2976b99d4b5e861e6c988459ee20c20221123-408189-1p5gg4' -depth 8 RGB:-
```
I will put a pin in this until it becomes a bottleneck, but larger server admins may have thought about this before, and if so, let me know
what you came up with!
#### Elasticsearch
There's a little bit of a timebomb here, unfortunately. Following [[Full-text
search](https://docs.joinmastodon.org/admin/optional/elasticsearch/)] docs, the install and integration is super easy. But, in an upcoming
release, Elasticsearch is going to _force_ authentication by default, even though in the current version they are still tolerant of
non-secured instances, those will break in the future. So I'm going to get ahead of that and create my instance with the minimally required
security setup in mind [[ref](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html)]:
```
pim@ublog:~$ cat << EOF | sudo tee -a /etc/elasticsearch/elasticsearch.yml
xpack.security.enabled: true
discovery.type: single-node
EOF
pim@ublog:~$ PASS=$(openssl rand -base64 12)
pim@ublog:~$ /usr/share/elasticsearch/bin/elasticsearch-setup-passwords interactive
(use this $PASS for the 'elastic' user)
pim@ublog:~$ cat << EOF | sudo tee -a ~mastodon/live/.env.production
ES_USER=elastic
ES_PASS=$PASS
EOF
pim@ublog:~$ sudo systemctl restart mastodon-streaming mastodon-web mastodon-sidekiq
```
Elasticsearch is a memory hog, which is not that strange considering its job is to supply full text retrieval in a large
amount of documents and data at high performance. It'll by default grab roughly half of the machine's memory, which it
really doesn't need for now. So, I'll give it a little bit of a smaller playground to expand into, by limiting it's heap
to 2 GB to get us started:
```
pim@ublog:~$ cat << EOF | sudo tee /etc/elasticsearch/jvm.options.d/memory.options
-Xms2048M
-Xmx2048M
EOF
pim@ublog:~$ sudo systemctl restart elasticsearch
```
#### Mail
E-mail can be quite tricky to get right. At IPng we've been running mailservers for a while now, and we're reasonably good at delivering
mail even to the most hard-line providers (looking at you, GMX and Google). We use relays from a previous project of mine called
[[PaPHosting](https://paphosting.net)], which you can clearly see comes from the Dark Ages when the Internet was still easy. These days, our
mailservers run a combination of STS-MTA, TLS certs from Lets Encrypt, DMARC, and SPF. So our outbound mail is simply using OpenBSD's
smtpd(8), and it forwards to the remote relay pool of five servers using authentication, but only after rewriting the envelope to always
come from `@ublog.tech` and match the e-mail sender (which allows for strict SPF):
```
pim@ublog:~$ cat /etc/smtpd.conf
table aliases file:/etc/aliases
table secrets file:/etc/mail/secrets
listen on localhost
action "local_mail" mbox alias <aliases>
action "outbound" relay host "smtp+tls://papmx@smtp.paphosting.net" auth <secrets> \
mail-from "@ublog.tech"
match from local for local action "local_mail"
match from local for any action "outbound"
```
Inbound mail to the `@ublog.tech` domain is also handled by the paphosting servers, which forward them all to our respective inboxes.
#### Server Settings
After reading a post from [[@rriemann@chaos.social](https://ublog.tech/@rriemann@chaos.social/109384055799108617)], I was quickly convinced
that having a good privacy policy is worth the time. I took their excellent advice to create a reasonable [[Privacy
Policy](https://ublog.tech/privacy-policy)]. Thanks again for that, and if you're running a server in Europe or with European users,
definitely check it out.
Rules are important. I didn't give this as much thought, but I did assert some ground rules. Even though I do believe in [[Postel's
Robustness Principle](https://en.wikipedia.org/wiki/Robustness_principle)] (_Be liberal in what you accept, and conservative in what you
send._), I generally tend to believe that computers lose their temper less often than humans, so I started off with:
1. **Behavioral Tenets**: Use welcoming and inclusive language, be respectful of differing viewpoints and experiences, gracefully accept
constructive criticism, focus on what is best for the community, show empathy towards other community members. Be kind to each other, and
yourself.
1. **Unacceptable behavior**: Use of sexualized language or imagery, unsolicited romantic attention, trolling, derogatory
comments, personal or political attacks, doxxing are strictly prohibited. Use conduct considered inappropriate for a professional setting.
{{< image width="70px" float="left" src="/assets/mastodon/msie.png" alt="Favicon" >}}
I also read an entertaining (likely insider-joke) post from [[@nova@hachyderm.io](https://ublog.tech/@nova@hachyderm.io/109389072740558566)],
in which she was asking about the internet explorer favicon on her instance, so I couldn't resist but replace the mastodon favicon with the
IPng Networks one. Vanity matters.
## What's next
Now that the server is up, and I have a small amount of users (mostly folks I know from the tech industry), I took some time to explore
both the Fediverse, reach out to friends old and new, participate in a few random discussions, fiddle with the iOS apps (and in the end,
settled on Toot! with a runner up of Metatext), and generally had an *amazing* time on Mastodon these last few days.
Now, I think I'm ready to further productionize the experience. My next article will cover monitoring - a vital aspect of any serious
project. I'll go over Prometheus, Grafana, Alertmanager and how to get the most signal out of a running Mastodon instance. Stay tuned!
If you're looking for a home, feel free to sign up at [https://ublog.tech/](https://ublog.tech/) as I'm sure that having a bit more load /
traffic on this instance will allow me to learn (and in turn, to share with others)!

View File

@ -0,0 +1,215 @@
---
date: "2022-11-24T01:20:14Z"
title: Mastodon - Part 2 - Monitoring
---
# About this series
{{< image width="200px" float="right" src="/assets/mastodon/mastodon-logo.svg" alt="Mastodon" >}}
I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I've been feeling less
enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using "free" services
is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but
for me it's time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to
privately operated ones.
In the [[previous post]({% post_url 2022-11-20-mastodon-1 %})], I shared some thoughts on how the overall install of a Mastodon instance
went, making it a point to ensure my users' (and my own!) data is somehow safe, and the machine runs on good hardware, and with good
connectivity. Thanks IPng, for that 10G connection! In this post, I visit an old friend,
[[Borgmon](https://sre.google/sre-book/practical-alerting/)], which has since reincarnated and become the _de facto_ open source
observability and signals ecosystem, and its incomparably awesome friend. Hello, **Prometheus** and **Grafana**!
## Anatomy of Mastodon
Looking more closely at the architecture of Mastodon, it consists of a few moving parts:
1. **Storage**: At the bottom, there's persistent storage, in my case **ZFS**, on which account information (like avatars), media
attachments, and site-specific media lives. As posts stream to my instance, their media is spooled locally for performance.
1. **State**: Application state is kept in two databases:
* Firstly, a **SQL database** which is chosen to be [[PostgreSQL](https://postgresql.org)].
* Secondly, a memory based **key-value storage** system [[Redis](https://redis.io)] is used to track the vitals of home feeds,
list feeds, Sidekiq queues as well as Mastodon's streaming API.
1. **Web (transactional)**: The webserver that serves end user requests and the API is written in a Ruby framework called
[[Puma](https://github.com/puma/puma)]. Puma tries to do its job efficiently, and doesn't allow itself to be bogged down by long lived web
sessions, such as the ones where clients get streaming updates to their timelines on the web- or mobile client.
1. **Web (streaming)**: This webserver is written in [[NodeJS](https://nodejs.org/en/about/)] and excels at long lived connections
that use Websockets, by providing a Streaming API to clients.
1. **Web (frontend)**: To tie all the current and future microservices together, provide SSL (for HTTPS), and a local object cache for
things that don't change often, one or more [[NGINX](https://nginx.org)] servers are used.
1. **Backend (processing)**: Many interactions with the server (such as distributing posts) turn in to background tasks that are enqueued
and handled asynchronously by a worker pool provided by [[Sidekiq](https://github.com/mperham/sidekiq)].
1. **Backend (search)**: Users that wish to search the local corpus of posts and media, can interact with an instance of
[[Elastic](https://www.elastic.co/)], a free and open search and analytics solution.
These systems all interact in particular ways, but I immediately noticed one interesting tidbit. Pretty much every system in this list can
(or can be easily made to) emit metrics in a popular [[Prometheus](https://prometheus.io/)] format. I cannot overstate the love I have for
this project, both technically but also socially because I know how it came to be. Ben, thanks for the RC racecars (I still have them!).
Matt, I admire your Go- and Java-skills and your general workplace awesomeness. And Richi sorry to have missed you last week in Hamburg at
[[DENOG14](https://www.denog.de/de/meetings/denog14/agenda.html)]!
## Prometheus
Taking stock of the architecture here, I think my best bet is to rig this stuff up with Prometheus. This works mostly by having a central,
in my case external to [[uBlog.tech](https://ublog.tech)] server scrape a bunch of timeseries metrics periodically, after which I can create
pretty graphs of them, but also monitor if some values seem out of whack, like a Sidekiq queue delay raising, CPU or disk I/O running a bit
hot. And the best thing yet? I will get pretty much all of this for free, because other, smarter folks have contributed into this
ecosystem already:
* **Server**: monitoring is canonically done by [[Node Exporter](https://prometheus.io/docs/guides/node-exporter/)]. It provides metrics for
all the lowlevel machine and kernel stats you'd ever think to want: network, disk, cpu, processes, load, and so on.
* **Redis**: Is provided by [[Redis Exporter](https://github.com/oliver006/redis_exporter)] and can show all sorts of operations on data
realms served by Redis.
* **PostgreSQL**: is provided by [[Postgres Exporter](https://github.com/prometheus-community/postgres_exporter)] which is
maintained by the Prometheus Community.
* **NGINX**: Is provided by [[NGINX Exporter](https://github.com/nginxinc/nginx-prometheus-exporter)] which is maintained by the company
behind NGINX. I used to have a Lua based exporter (when I ran [[SixXS](https://sixxs.net/)]) which had lots of interesting additional
stats, but for the time being I'll just use this one.
* **Elastic**: has a converter from its own metrics system in the [[Elasticsearch
Exporter](https://github.com/prometheus-community/elasticsearch_exporter)], once again maintained by the (impressively fabulous!)
Prometheus Community.
All of these implement a common pattern: they take the (bespoke, internal) representation of statistics counters or dials/gauges, and
transform them into a common format called the _Metrics Exposition_ format, and they provide this in either an HTTP endpoint (typically
using a `/metrics` URI handler directly on the webserver), or in a push-mechanism using a popular
[[Pushgateway](https://prometheus.io/docs/instrumenting/pushing/)] in case there is no server to poll, for example a batch process that did
some work and wanted to report on its results.
Incidentally, a fair amount of popular open source infrastructure already has a Prometheus exporter -- check out [[this
list](https://prometheus.io/docs/instrumenting/exporters/)], but also the assigned [[TCP
ports](https://github.com/prometheus/prometheus/wiki/Default-port-allocations)] for popular things that you might also be using.
Maybe you'll get lucky and find out that somebody has already provided an exporter, so you don't have to!
### Configuring Exporters
Now that I have found a whole swarm of these Prometheus Exporter microservices, and understand how to plumb each of them through to
what-ever it is they are monitoring, I can get cracking on some observability. Let me provide some notes for posterity, both for myself
if I ever revisit the topic and ... kind of forgot what I had done so far :), but maybe also for the adventurous, who are interested in
using Prometheus on their own Mastodon instance.
First of all, it's worth mentioning that while these exporters (typically written in Go) have command line flags, they can often also take
their configuration from environment variables, provided mostly becasue they operate in Docker or Kubernetes. My exporters will all run
_vanilla_ in **systemd**, but these systemd units can also be configured to use environments, which is neat!
First, I create a few environment files for each **systemd** unit that contains a Prometheus exporter:
```
pim@ublog:~$ ls -la /etc/default/*exporter
-rw-r----- 1 root root 49 Nov 23 18:15 /etc/default/elasticsearch-exporter
-rw-r----- 1 root root 76 Nov 22 17:13 /etc/default/nginx-exporter
-rw-r----- 1 root root 170 Nov 22 22:41 /etc/default/postgres-exporter
-rw-r----- 1 root root 9789 May 27 2021 /etc/default/prometheus-node-exporter
-rw-r----- 1 root root 0 Nov 22 22:56 /etc/default/redis-exporter
-rw-r----- 1 root root 67 Nov 22 23:56 /etc/default/statsd-exporter
```
The contents of these files will give away passwords, like the one for ElasticSearch or Postgres, so I specifically make them readable only
by `root:root`. I won't share my passwords with you, dear reader, so you'll have to guess the contents here!
Priming the environment with these values, I will take the **systemd** unit for elasticsearch as an example:
```
pim@ublog:~$ cat << EOF | sudo tee /lib/systemd/system/elasticsearch-exporter.service
[Unit]
Description=Elasticsearch Prometheus Exporter
After=network.target
[Service]
EnvironmentFile=/etc/default/elasticsearch-exporter
ExecStart=/usr/local/bin/elasticsearch_exporter
User=elasticsearch
Group=elasticsearch
Restart=always
[Install]
WantedBy=multi-user.target
EOF
pim@ublog:~$ cat << EOF | sudo tee /etc/default/elasticsearch-exporter
ES_USERNAME=elastic
ES_PASSWORD=$(SOMETHING_SECRET) # same as ES_PASS in .env.production
EOF
pim@ublog:~$ sudo systemctl enable elasticsearch-exporter
pim@ublog:~$ sudo systemctl start elasticsearch-exporter
```
Et voil&agrave;, just like that the service starts, connects to elasticsearch, transforms all of its innards into beautiful Prometheus metrics, and
exposes them on its "registered" port, in this case 9114, which can be scraped by the Prometheus instance a few computers away, connected to
the uBlog VM via backend LAN over RFC1918. I just _knew_ that second NIC would come in useful!
{{< image width="400px" float="right" src="/assets/mastodon/prom-metrics.png" alt="ElasticSearch Metrics" >}}
All **five** of the exporters are configured and exposed. They are now providing a wealth of realtime information
on how the various Mastodon components are going. And if any of them start malfunctioning, or running out of steam, or simply taking the day
off, I will be able to see this either by certrain metrics going out of expected ranges, or by the exporter reporting that it cannot even
find the service at all (which we can also detect and turn into alarms, more on that later).
Pictured here (you should probably open it in full resolution unless you have hawk eyes), is an example of those metrics, of which
Prometheus is happy to handle several million at relatively high period of scraping, in my case every 10 seconds, it comes around and pulls
the data from these five exporters. While these metrics are human readable, they aren't very practical...
## Grafana
... so let's visualize them with an equally awesome tool: [[Grafana](https://grafana.com/)]. This tool provides operational dashboards for any
data that is stored here, there, or anywhere :) Grafana can render stuff from a plethora of backends, one popular and established one is
Prometheus. And as it turns out, as with Prometheus, lots of work has been done already with canonical, almost out-of-the-box, dashboards
that were contributed by folks in the field. n fact, every single one of the five exporters I installed, also have an accompanying
dashboard, sometimes even multiple to choose from! Grafana allows you to [[search and download](https://grafana.com/grafana/dashboards/)]
these from a corpus they provide, referring to them by their `id`, or alternatively downloading a JSON representation of the dashboard, for
example one that comes with the exporter, or one you find on GitHub.
For uBlog, I installed: [[Node Exporter](https://grafana.com/grafana/dashboards/1860-node-exporter-full/)], [[Postgres
Exporter](https://grafana.com/grafana/dashboards/9628-postgresql-database/)], [[Redis
Exporter](https://grafana.com/grafana/dashboards/11692-redis-dashboard-for-prometheus-redis-exporter-1-x/)], [[NGINX
Exporter](https://github.com/nginxinc/nginx-prometheus-exporter/blob/main/grafana/README.md)], and [[ElasticSearch
Exporter](https://grafana.com/grafana/dashboards/14191-elasticsearch-overview/)].
{{< image width="400px" float="right" src="/assets/mastodon/grafana-psql.png" alt="Grafana Postgres" >}}
To the right (top) you'll see a dashboard for PostgreSQL - it has lots of expert insights on how databases are used, how many read/write
operations (like SELECT and UPDATE/DELETE queries) are performed, and their respective latency expectations. What I find particularly useful
is the total amount of memory, CPU and disk activity. This allows me to see at a glance when it's time to break out
[[pgTune](https://github.com/gregs1104/pgtune)] to help change system settings for Postgres, or even inform me when it's time to move
the database to its own server rather than co-habitating with the other stuff running on this virtual machine. In my experience, stateful
systems are often the source of bottlenecks, so I take special care to monitor them and observe their performance over time. In particular,
slowness will be seen in Mastodon if the database is slow (sound familiar?).
{{< image width="400px" float="right" src="/assets/mastodon/grafana-redis.png" alt="Grafana Redis" >}}
Next, to the right (middle) you'll see a dashboard for Redis. This one shows me how full the Redis cache is (you can see the yellow line at
in the first graph there is when I restarted Redis to give it a `maxmemory` setting of 1GB), but also a high resolution overview of how many
operations it's doing. I can see that the server is spiky and upon closer inspection this is the `pfcount` command with a period of exactly
300 seconds, in other words something is spiking every 5min. I have a feeling that this might become an issue... and when it does, I'll get
to learn all about this elusive [[pfcount](https://redis.io/commands/pfcount/)] command. But until then, I can see the average time by
command: because Redis serves from RAM and this is a pretty quick server, I see the turnaround time for most queries to it in the
200-500 &micro;s range, wow!
{{< image width="400px" float="right" src="/assets/mastodon/grafana-node.png" alt="Grafana Node" >}}
But while these dashboards are awesome, what I find has saved me (and my ISP, IPng Networks) a metric tonne of time, is the most fundamental
monitoring in the Node Exporter dashboard, pictured to the right (bottom). What I really love about this dashboard, is that it shows at a
glance the parts of the _computer_ that are going to become a problem. If RAM is full (but not because of filesystem cache), or CPU is
running hot, or the network is flatlining at a certain throughput or packets/sec limit, these are all things that the applications running
_on_ the machine won't necessarily be able to show me more information on, but the _Node Exporter_ to the rescue: it has so many interesting
pieces of kernel and host operating system telemetry, that it is one of the single most useful tools I know. Every physical host and every
virtual machine, is exporting metrics into IPng Networks' prometheus instance, and it constantly shows me what to improve. Thanks, Obama!
## What's next
Careful readers will have noticed that this whole article talks about all sorts of interesting telemetry, observability metrics, and
dashboards, but they are all _common components_, and none of them touch on the internals of Mastodon's processes, like _Puma_ or _Sidekiq_
or the _API Services_ that Mastodon exposes. Consider this a cliff hanger (eh, mostly because I'm a bit busy at work and will need a little
more time).
In an upcoming post, I take a deep dive into this application-specific behavior and how to extract this telemetry (spoiler alert: it can be
done! and I will open source it!), as I've started to learn more about how Ruby gathers and exposes its own internals. Interestingly, one of
the things that I'll talk about is _NSA_ but not the American agency, rather a comical wordplay from some open source minded folks who have
blazed the path in making Ruby's Rail application performance metrics available to external observers. In a round-about way, I hope to
show how to plug these into Prometheus in the same way all the other exporters have already done so.
By the way: If you're looking for a home, feel free to sign up at [https://ublog.tech/](https://ublog.tech/) as I'm sure that having a bit
more load / traffic on this instance will allow me to learn (and in turn, to share with others)!

View File

@ -0,0 +1,327 @@
---
date: "2022-11-27T00:01:14Z"
title: Mastodon - Part 3 - statsd and Prometheus
---
# About this series
{{< image width="200px" float="right" src="/assets/mastodon/mastodon-logo.svg" alt="Mastodon" >}}
I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I've been feeling less
enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using "free" services
is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but
for me it's time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to
privately operated ones.
In my [[first post]({% post_url 2022-11-20-mastodon-1 %})], I shared some thoughts on how I installed a Mastodon instance for myself. In a
[[followup post]({% post_url 2022-11-24-mastodon-2 %})] I talked about its overall architecture and how one might use Prometheus to monitor
vital backends like Redis, Postgres and Elastic. But Mastodon _itself_ is also an application which can provide a wealth of telemetry using
a protocol called [[StatsD](https://github.com/statsd/statsd)].
In this post, I'll show how I tie these all together in a custom **Grafana Mastodon dashboard**!
## Mastodon Statistics
I noticed in the [[Mastodon docs](https://docs.joinmastodon.org/admin/config/#statsd)], that there's a one-liner breadcrumb that might be
easy to overlook, as it doesn't give many details:
> `STATSD_ADDR`: If set, Mastodon will log some events and metrics into a StatsD instance identified by its hostname and port.
Interesting, but what is this **statsd**, precisely? It's a simple text-only protocol that allows applications to send key-value pairs in
the form of `<metricname>:<value>|<type>` strings, that carry statistics of certain _type_ across the network, using either TCP or UDP.
Cool! To make use of these stats, I first add this `STATSD_ADDR` environment variable from the docs to my `.env.production` file, pointing
it at `localhost:9125`. This should make Mastodon apps emit some statistics of sorts.
I decide to a look at those packets, instructing tcpdump to show the contents of the packets (using the `-A` flag). Considering my
destination is `localhost`, I also know which interface to tcpdump on (using the `-i lo` flag). My first attempt is a little bit noisy,
because the packet dump contains the [[IPv4 header](https://en.wikipedia.org/wiki/IPv4)] (20 bytes) and [[UDP
header](https://en.wikipedia.org/wiki/User_Datagram_Protocol)] (8 bytes) as well, but sure enough, if I start reading from the 28th byte
onwards, I get human-readable data, in a bunch of strings that start with `Mastodon`:
```
pim@ublog:~$ sudo tcpdump -Ani lo port 9125 | grep 'Mastodon.production.' | sed -e 's,.*Mas,Mas,'
Mastodon.production.sidekiq.ActivityPub..ProcessingWorker.processing_time:16272|ms
Mastodon.production.sidekiq.ActivityPub..ProcessingWorker.success:1|c
Mastodon.production.sidekiq.scheduled_size:25|g
Mastodon.production.db.tables.accounts.queries.select.duration:1.8323479999999999|ms
Mastodon.production.web.ActivityPub.InboxesController.create.json.total_duration:33.856679|ms
Mastodon.production.web.ActivityPub.InboxesController.create.json.db_time:2.3943890118971467|ms
Mastodon.production.web.ActivityPub.InboxesController.create.json.view_time:1|ms
Mastodon.production.web.ActivityPub.InboxesController.create.json.status.202:1|c
...
```
**statsd** organizes its variable names in a dot-delimited tree hierarchy. I can clearly see some patterns in here, but why guess
when you're working with Open Source? Mastodon turns out to be using a popular Ruby library called the [[National Statsd
Agency](https://github.com/localshred/nsa)], a wordplay that I don't necessarily find all that funny. Naming aside though, this library
collects application level statistics in four main categories:
1. **:action_controller**: listens to the ActionController class that is extended into ApplicationControllers in Mastodon
1. **:active_record**: listens to any database (SQL) queries and emits timing information for them
1. **:active_support_cache**: records information regarding caching (Redis) queries, and emits timing information for them
1. **:sidekiq**: listens to Sidekiq middleware and emits information about queues, workers and their jobs
Using the library's [[docs](https://github.com/localshred/nsa)], I can clearly see the patterns described, for example in the SQL
recorder, the format will be `{ns}.{prefix}.tables.{table_name}.queries.{operation}.duration` where operation here means one of the
classic SQL query types, **SELECT**, **INSERT**, **UPDATE**, and **DELETE**. Similarly, in the cache recorder, the format will be
`{ns}.{prefix}.{operation}.duration` where operation denotes one of **read_hit**, **read_miss**, **generate**, **delete**, and so on.
Reading a bit more of the Mastodon and **statsd** library code, I learn that all variables emitted, the namespace `{ns}` is always a
combination of the application name and Ruby Rails environment, ie. **Mastodon.production**, and the `{prefix}` is the collector name,
one-of **web**, **db**, **cache** or **sidekiq**. If you're curious, the Mastodon code that initializes the **statsd** collectors lives
in `config/initializers/statsd.rb`. Alright, I conclude that this is all I need to know about the naming schema.
Moving along, **statsd** gives each variable name a [[metric type](https://github.com/statsd/statsd/blob/master/docs/metric_types.md)], which
can be counters **c**, timers **ms** and gauges **g**. In the packet dump above you can see examples of each of these. The counter type in
particular is a little bit different -- applications emit increments here - in the case of the ActivityPub.InboxesController, it
merely signaled to increment the counter by 1, not the absolute value of the counter. This is actually pretty smart, because now any number
of workers/servers can all contribute to a global counter, by each just sending incrementals which are aggregated by the receiver.
As a small critique, I happened to notice that in the sidekiq datastream, some of what I think are _counters_ are actually modeled as
_gauges_ (notably the **processed** and **failed** jobs from the workers). I will have to remember that, but after observing for a few
minutes, I think I can see lots of nifty data in here.
## Prometheus
At IPng Networks, we use Prometheus as a monitoring observability tool. It's worth pointing out that **statsd** has a few options itself to
visualise data, but considering I already have lots of telemetry in Prometheus and Grafana (see my [[previous post]({% post_url
2022-11-24-mastodon-2 %})]), I'm going to take a bit of a detour, and convert these metrics into the Prometheus _exposition format_, so that
they can be scraped on a `/metrics` endpoint just like the others. This way, I have all monitoring in one place and using one tool.
Monitoring is hard enough as it is, and having to learn multiple tools is _no bueno_ :)
### Statsd Exporter: overview
The community maintains a Prometheus [[Statsd Exporter](https://github.com/prometheus/statsd_exporter)] on GitHub. This tool, like many
others in the exporter family, will connect to a local source of telemetry, and convert these into the required format for consumption by
Prometheus. If left completely unconfigured, it will simply receive the **statsd** UDP packets on the Mastodon side, and export them
verbatim on the Prometheus side. This will have a few downsides, notably when new operations or controllers come into existence, I would
have to explicitly make Prometheus aware of them.
I think we can do better, specifically because of the patterns noted above, I can condense the many metricnames from **statsd** into a few
carefully chosen Prometheus metrics, and add their variability into _labels_ in those time series. Taking SQL queries as an example, I see
that there's a metricname for each known SQL table in Mastodon (and there are many) and then for each table, a unique metric is created for
each of the four operations:
```
Mastodon.production.tables.{table_name}.queries.select.duration
Mastodon.production.tables.{table_name}.queries.insert.duration
Mastodon.production.tables.{table_name}.queries.update.duration
Mastodon.production.tables.{table_name}.queries.delete.duration
```
What if I could rewrite these by capturing the `{table_name}` label, and further observing that there are four query types (SELECT, INSERT,
UPDATE, DELETE), so possibly capturing those into a `{operation}` label, like so:
```
mastodon_db_operation_sum{operation="select",table="users"} 85.910
mastodon_db_operation_sum{operation="insert",table="accounts"} 112.70
mastodon_db_operation_sum{operation="update",table="web_push_subscriptions"} 6.55
mastodon_db_operation_sum{operation="delete",table="web_settings"} 9.668
mastodon_db_operation_count{operation="select",table="users"} 28790
mastodon_db_operation_count{operation="insert",table="accounts"} 610
mastodon_db_operation_count{operation="update",table="web_push_subscriptions"} 380
mastodon_db_operation_count{operation="delete",table="web_settings"} 4)
```
This way, there are only two Prometheus metric names **mastodon_db_operation_sum** and **mastodon_db_operation_count**. The first one counts
the cumulative time spent performing operations of that type on the table, and the second one counts the total amount of queries of that
type on the table. If I take the **rate()** of the count variable, I will have queries-per-second, and if I divide the **rate()** of the
time spent by the **rate()** of the count, I will have a running average time spent per query over that time interval.
### Statsd Exporter: configuration
The Prometheus folks also thought of this, _quelle surprise_, and the exporter provides incredibly powerful transformation functionality
between the hierarchical tree-form of **statsd** and the multi-dimensional labeling format of Prometheus. This is called the [[Mapping
Configuration](https://github.com/prometheus/statsd_exporter#metric-mapping-and-configuration)], and it allows either globbing or regular
expression matching of the input metricnames, turning them into labeled output metrics. Building further on our example for SQL queries, I
can create a mapping like so:
```
pim@ublog:~$ cat << EOF | sudo tee /etc/prometheus/statsd-mapping.yaml
mappings:
- match: Mastodon\.production\.db\.tables\.(.+)\.queries\.(.+)\.duration
match_type: regex
name: "mastodon_db_operation"
labels:
table: "$1"
operation: "$2"
```
This snippet will use a regular expression to match input metricnames, carefully escaping the dot-delimiters. Within the input, I will match
two groups, the segment following `tables.` holds the variable SQL Table name and the segment following `queries.` captures the SQL Operation.
Once this matches, the exporter will give the resulting variable in Prometheus simply the name `mastodon_db_operation` and add two labels
with the results of the regexp capture groups.
This one mapping I showed above will take care of _all of the metrics_ from the **database** collector, but are three other collectors in
Mastodon's Ruby world. In the interest of brevity, I'll not bore you with them in this article, as this is mostly a rinse-and-repeat jobbie.
But I have attached a copy of the complete mapping configuration at the end of this article. With all of that hard work on mapping
completed, I can now start the **statsd** exporter and see its beautifully formed and labeled timeseries show up on port 9102, the default
[[assigned port](https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exporters.md)] for this exporter type.
## Grafana
First let me start by saying I'm incredibly grateful to all the folks who have contributed to existing exporters and Grafana dashboards,
notably for [[Node Exporter](https://grafana.com/grafana/dashboards/1860-node-exporter-full/)],
[[Postgres Exporter](https://grafana.com/grafana/dashboards/9628-postgresql-database/)],
[[Redis Exporter](https://grafana.com/grafana/dashboards/11692-redis-dashboard-for-prometheus-redis-exporter-1-x/)],
[[NGINX Exporter](https://github.com/nginxinc/nginx-prometheus-exporter/blob/main/grafana/README.md)], and [[ElasticSearch
Exporter](https://grafana.com/grafana/dashboards/14191-elasticsearch-overview/)]. I'm ready to make a modest contribution back to this
wonderful community of monitoring dashboards, in the form of a Grafana dashboard for Mastodon!
Writing these is pretty rewarding. I'll take some time to explain a few Grafana concepts, although this is not meant to be a tutorial at
all and honestly, I'm not that good at this anyway. A good dashboard design goes from a 30'000ft overview of the most vital stats (not
necessarily graphs, but using visual clues like colors), and gives more information in so-called _drill-down_ dashboards that allow a much
finer grained / higher resolution picture of a specific part of the monitored application.
Seeing as the collectors are emitting four main parts of the application (remember, the `{prefix}` is one of **web**, **db**, **cache**, or
**sidekiq**), so I will give the dashboard the same structure. Also, I will try my best not to invent new terminology, as the application
developers have given their telemetry certain names, I will stick to these as well. Building a dashboard this way, application developers as
well as application operators will more likely be talking about the same things.
#### Mastodon Overview
{{< image src="/assets/mastodon/mastodon-stats-overview.png" alt="Mastodon Stats Overview" >}}
In the **Mastodon Overview**, each of the four collectors gets one or two stats-chips to present their highest level vitalsigns on. For a
web application, this will largely be requests per second, latency and possibly errors served. For a SQL database, this is typically the
issued queries and their latency. For a cache, the types of operation and again the latency observed in those operations. For Sidekiq (the
background worker pool that performs certain tasks in a queue on behalf of the system or user), I decide to focus on units of work, latency
and queue sizes.
Setting up the Prometheus queries in Grafana that fetch the data I need for these is typically going to be one of two things:
1. **QPS**: This is a rate of the monotonic increasing *_count*, over say one minute, I will see the average queries-per-second. Considering
the counters I created have labels that tell me what they are counting (for example in Puma, which API endpoint is being queried, and
what format that request is using), I can now elegantly aggregate those application-wide, like so:
> sum by (mastodon)(rate(mastodon_controller_duration_count[1m]))
2. **Latency**: The metrics in Prometheus aggregate a runtime monotonic increasing *_sum* which tells me about the total time spent doing
those things. It's pretty easy to calculate the running average latency over the last minute, by simply dividing the rate of time spent
by the rate of requests served, like so:
> sum by (mastodon)(rate(mastodon_controller_duration_sum[1m])) <br /> / <br />
> sum by (mastodon)(rate(mastodon_controller_duration_count[1m]))
To avoid clutter, I will leave the detailed full resolution view (like which _controller_ exactly, and what _format_ was queried, and which
_action_ was taken in the API) to a drilldown below. These two patterns are continud throughout the overview panel. Each QPS value is
rendered in dark blue, while the latency gets a special treatment on colors: I define a threshold which I consider "unacceptable", and then
create a few thresholds in Grafana to change the color as I approach that unacceptable max limit. By means of example, the Puma Latency
element I described above, will have a maximum acceptable latency of 250ms. If the latency is above 40% of that, the color will turn yellow;
above 60% it'll turn orange and above 80% it'll turn red. This provides a visual queue that something may be wrong.
#### Puma Controllers
The APIs that Mastodon offers are served by a component called [[Puma](https://github.com/puma/puma)], a simple, fast, multi-threaded, and
highly parallel HTTP 1.1 server for Ruby/Rack applications. The application running in Puma typically defines endpoints as so-called
ActionControllers, which Mastodon expands on in a derived concept called ApplicationControllers which each have a unique _controller_ name
(for example **ActivityPub.InboxesController**), an _action_ performed on them (for example **create**, **show** or **destroy**), and a
_format_ in which the data is handled (for example **html** or **json**). For each cardinal combination, a set of timeseries (counter, time
spent and latency quantiles) will exist. At the moment, there are about 53 API controllers, 8 actions, and 4 formats, which means there are
1'696 interesting metrics to inspect.
Drawing all of these in one graph quickly turns into an unmanageable mess, but there's a neat trick in Grafana: what if I could make these
variables selectable, and maybe pin them to exactly one value (for example, all information with a specific _controller_), that would
greatly reduce the amount of data we have to show. To implement this, the dashboard can pre-populate a variable based on a Prometheus query.
By means of example, to find the possible values of _controller_, I might take a look at all Prometheus metrics with name
**mastodon_controller_duration_count** and search for labels within them with a regular expression, for example **/controller="([^"]+)"/**.
What this will do is select all values in the group `"([^"]+)"` which may seem a little bit cryptic at first. The logic behind it is first
create a group between parenthesis `(...)` and then within that group match a set of characters `[...]` where the set is all characters
except the double-quote `[^"]` and that is repeated one-or-more times with the `+` suffix. So this will precisely select the string between
the double-quotes in the label: `controller="whatever"` will return `whatever` with this expression.
{{< image width="400px" float="right" src="/assets/mastodon/mastodon-puma-details.png" alt="Mastodon Puma Controller Details" >}}
After creating three of these, one for _controller_, _action_ and _format_, three new dropdown selectors appear at the top of my dashboard.
I will allow any combination of selections, including "All" of them (the default). Then, if I wish to drill down, I can pin one or more of
these variables to narrow down the total amount of timeseries to draw.
Shown to the right are two examples, one with "All" timeseries in the graph, which shows at least which one(s) are outliers. In this case,
the orange trace in the top graph showing more than average operations is the so-called **ActivityPub.InboxesController**.
I can find this out by hovering over the orange trace, the tooltip will show me the current name and value. Then, selecting this in the top
navigation dropdown **Puma Controller**, Grafana will narrow down the data for me to only those relevant to this controller, which is super
cool.
#### Drilldown in Grafana
{{< image width="400px" float="right" src="/assets/mastodon/mastodon-puma-controller.png" alt="Mastodon Puma Controller Details" >}}
Where the graph down below (called _Action Format Controller Operations_) showed all 1'600 or so timeseries, selecting the one controller I'm
interested in, shows me a much cleaner graph with only three timeseries, take a look to the right. Just by playing around with this data, I'm
learning a lot about the architecture of this application!
For example, I know that the only _action_ on this particular controller seems to be **create**, and there are three available _formats_ in which
this create action can be performed: **all**, **html** and **json**. And using the graph above that got me started on this little journey, I now
know that the traffic spike was for `controller=ActivityPub.InboxesController, action=create, format=all`. Dope!
#### SQL Details
{{< image src="/assets/mastodon/mastodon-sql-details.png" alt="Mastodon SQL Details" >}}
While I already have a really great [[Postgres Dashboard](https://grafana.com/grafana/dashboards/9628-postgresql-database/)] (the one that
came with Postgres _server_), it is also good to be able to see what the _client_ is experiencing. Here, we can drill down on two variables,
called `$sql_table` and `$sql_operation`. For each {table,operation}-tuple, the average, median and 90th/99th percentile latency are
available. So I end up with the following graphs and dials for tail latency: the top left graph shows me something interesting -- most
queries are SELECT, but the bottom graph shows me lots of tables (at the time of this article, Mastodon has 73 unique SQL tables). If I
wanted to answer the question "which table gets most SELECTs", I can drill down first by selecting the **SQL Operation** to be **select**,
after which I see decidedly less traces in the _SQL Table Operations_ graph. Further analysis shows that the two places that are mostly read
from are the tables called **statuses** and **accounts**. When I drill down using the selectors at the top of Grafana's dashboard UI, the tail
latency is automatically filtered to only that which is selected. If I were to see very slow queries at some point in the future, it'll be
very easy to narrow down exactly which table and which operation is the culprit.
#### Cache Details
{{< image src="/assets/mastodon/mastodon-cache-details.png" alt="Mastodon Cache Details" >}}
For the cache statistics collector, I learn there are a few different operators. Similar to Postgres, I already have a really cool
[[Redis Dashboard](https://grafana.com/grafana/dashboards/11692-redis-dashboard-for-prometheus-redis-exporter-1-x/)], for which I can see
the Redis _server_ view. But in Mastodon, I can now also see the _client_ view, and see when any of these operations spike in either
queries/sec (left graph), latency (middle graph), or tail latency for common operations (the dials on the right). This is bound to come in
handy at some point -- I already saw one or two spikes in the **generate** operation (see the blue spike in the screenshot above), which is
something to keep an eye on.
#### Sidekiq Details
{{< image src="/assets/mastodon/mastodon-sidekiq-details.png" alt="Mastodon Sidekiq Details" >}}
The single most
interesting thing in the Mastodon application is undoubtedly its _Sidekiq_ workers, the ones that do all sorts of system- and user-triggered
work such as distributing the posts to federated servers, prefetch links and media, and calculate trending tags, posts and links.
Sidekiq is a [[producer-consumer](https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem)] system where new units of work (called
jobs) are written to a _queue_ in Redis by a producer (typically Mastodon's webserver Puma, or another Sidekiq task that needs something to
happen at some point in the future), and then consumed by one or more pools which execute the _worker_ jobs.
There are several queues defined in Mastodon, and each _worker_ has a name, a _failure_ and _success_ rate, and a running tally of how much
_processing_time_ they've spent executing this type of work. Sidekiq workers will consume jobs in
[[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))] order, and it has a finite amount of workers (by default on a small
instance it runs one worker with 25 threads). If you're interested in this type of provisioning, [[Nora
Tindall](https://nora.codes/post/scaling-mastodon-in-the-face-of-an-exodus/)] wrote a great article about it.
This drill-down dashboard shows all of the Sidekiq _worker_ types known to Prometheus, and can be selected at the top of the dashboard in
the dropdown called **Sidekiq Worker**. A total amount of worker jobs/second, as well as the running average time spent performing those jobs
is shown in the first two graphs. The three dials show the median, 90th percentile and 99th percentile latency of the work being performed.
If all threads are busy, new work is left in the queue, until a worker thread is available to execute the job. This will lead to a queue
delay on a busy server that is underprovisioned. For jobs that had to wait for an available thread to pick them up, the number of jobs per
queue, and the time in seconds that the jobs were waiting to be picked up by a worker, are shown in the two lists at bottom right.
And with that, as [[North of the Border](https://www.youtube.com/@NorthoftheBorder)] would say: _"We're on to the glamour shots!"_.
## What's next
I made a promise on two references that will be needed to successfully hook up Prometheus and Grafana to the `STATSD_ADDR` configuration for
Mastodon's Rails environment, and here they are:
* The Statsd Exporter mapping configuration file: [[/etc/prometheus/statsd-mapping.yaml](/assets/mastodon/statsd-mapping.yaml)]
* The Grafana Dashboard: [[grafana.com/dashboards/](https://grafana.com/grafana/dashboards/17492-mastodon-stats/)]
**As a call to action:** if you are running a larger instance and would allow me to take a look and learn from you, I'd be very grateful.
I'm going to monitor my own instance for a little while, so that I can start to get a feeling for where the edges of performance cliffs are,
in other words: How slow is _too slow_? How much load is _too much_? In an upcoming post, I will take a closer look at alerting in Prometheus,
so that I can catch these performance cliffs and make human operators aware of them by means of alerts, delivered via Telegram or Slack.
By the way: If you're looking for a home, feel free to sign up at [https://ublog.tech/](https://ublog.tech/) as I'm sure that having a bit
more load / traffic on this instance will allow me to learn (and in turn, to share with others)!

View File

@ -0,0 +1,638 @@
---
date: "2022-12-05T11:56:54Z"
title: 'Review: S5648X-2Q4Z Switch - Part 1: VxLAN/GENEVE/NvGRE'
---
After receiving an e-mail from a newer [[China based switch OEM](https://starry-networks.com/)], I
had a chat with their founder and learned that the combination of switch silicon and software may be
a good match for IPng Networks. You may recall my previous endeavors in the Fiberstore lineup,
notably an in-depth review of the [[S5860-20SQ]({% post_url 2021-08-07-fs-switch %})] which sports
20x10G, 4x25G and 2x40G optics, and its larger sibling the S5860-48SC which comes with 48x10G and
8x100G cages. I use them in production at IPng Networks and their featureset versus price point is
pretty good. In that article, I made one critical note reviewing those FS switches, in that they'e
be a better fit if they allowed for MPLS or IP based L2VPN services in hardware.
{{< image width="450px" float="left" src="/assets/oem-switch/S5624X-front.png" alt="S5624X Front" >}}
{{< image width="450px" float="right" src="/assets/oem-switch/S5648X-front.png" alt="S5648X Front" >}}
<br /><br />
I got cautiously enthusiastic (albeit suitably skeptical) when this new vendor claimed VxLAN,
GENEVE, MPLS and GRE at 56 ports and line rate, on a really affordable budget (sub-$4K for the 56
port; and sub-$2K for the 26 port switch). This reseller is using a less known silicon vendor called
[[Centec](https://www.centec.com/silicon)], who have a lineup of ethernet silicon. In this device,
the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired
with 4x100GbE uplink capability. This is Centec's fourth generation, so CTC8096 inherits the feature
set from L2/L3 switching to advanced data center and metro Ethernet features with innovative
enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE
ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS SR,
and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic management,
and Network time synchronization.
This will be the first of a set of write-ups exploring the hard- and software functionality of this new
vendor. As we'll see, it's all about the _software_.
## Detailed findings
### Hardware
{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-front-opencase.png" alt="Front" >}}
The switch comes well packaged with two removable 400W Gold powersupplies from _Compuware
Technology_ which output 12V/33A and +5V/3A as well as four removable PWM controlled fans from
_Protechnic_. The fans are expelling air, so they are cooling front-to-back on this unit. Looking at
the fans, changing them to pull air back-to-front would be possible after-sale, by flipping the fans
around as they're attached in their case by two M4 flat-head screws. This is truly meant to be
an OEM switch -- there is no logo or sticker with the vendor's name, so I should probably print a
few vinyl IPng stickers to skin them later.
{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-daughterboard.png" alt="S5648 Daughterboard" >}}
On the front, the switch sports an RJ45 standard serial console, a mini-usb connector of which the
function is not clear to me, an RJ45 network port used for management, a pinhole which houses a
reset button labeled `RST` and two LED indicators labeled `ID` and `SYS`. The serial port runs at
115200,8n1 and the managment network port is Gigabit.
Regarding the regular switch ports, there are 48x SFP+ cages, 4x QSFP28 (port 49-52) runing at
100Gbit, and 2x QSFP+ ports (53-54) running at 40Gbit. All ports (management and switch) present a
MAC address from OUI `00-1E-08`, which is assigned to Centec.
The switch is not particularly quiet, as its six fans total start up at a high pitch but once the
switch boots, they calm down and emit noise levels as you would expect from a datacenter unit. I
measured it at 74dBA when booting, and otherwise at around 62dBA when running. On the inside, the
PCB is rather clean. It comes with a daughter board, housing a small PowerPC P1010 with 533MHz CPU,
1GB of RAM, and 2GB flash on board, which is running Linux. This is the same card that many of the
FS.com switches use (eg. S5860-48S6Q), a cheaper alternative to the high end Intel Xeon-D.
{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-switchchip-block.png" alt="S5648 Switchchip" >}}
#### S5648X (48x10, 2x40, 4x100)
There is one switch chip, on the front of the PCB, connecting all 54 ports. It has a sizable
heatsink on it, drawing air backwards through ports (36-48). The switch uses a less well known and
somewhat dated Centec [[CTC8096](https://www.centec.com/silicon/Ethernet-Switch-Silicon/CTC8096)],
codenamed GoldenGate and released in 2015, which is rated for 1.2Tbps of aggregate throughput. The
chip can be programmed to handle a bunch of SDN protocols, including VxLAN, GRE, GENEVE, and MPLS /
MPLS SR, with a limited TCAM to hold things like ACLs, IPv4/IPv6 routes and MPLS labels. The CTC8096
provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE ports. The SerDES design is
pretty flexible, allowing it to mix and match ports.
You can see more (hires) pictures and screenshots throughout these articles in this [[Photo
Album](https://photos.app.goo.gl/Mxzs38p355Bo4qZB6)].
#### S5624X (24x10, 2x100)
In case you're curious (I certainly was!) the smaller unit (with 24x10+2x100) is built off of the
Centic [[CTC7132](https://www.centec.com/silicon/Ethernet-Switch-Silicon/CTC7132)], codenamed
TsingMa which released in 2019, and it offers a variety of similar features, including L2, L3, MPLS,
VXLAN, MPLS SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and
traffic management, and Network time synchronization. The SoC has an embedded ARM A53 CPU Core
running at 800MHz, and the SerDES on this chip allows for 24x1G/2.5G/5G/10G and 2x40G/100G for a
throughput of 440Gbps but at a failry sharp price point.
One thing worth noting (because I know some of my regular readers will already be wondering!) this
series of chips (both the 4th generation CTC8096 and the sixth generation CTC7132) come with a very
modest TCAM, which means in practice 32K MAC addresses, 8K IPv4 routes, 1K IPv6 routes, a 6K MPLS
table, 1K L2VPN instances, and 64 VPLS instances. The Centec comes as well with a modest 32MB packet
buffer shared between all ports, and the controlplane comes with 1GB of memory and a 533MHz ARM. So
no, this won't take a full table :-) but in all honesty, that's not the thing this machine is built
to do.
When booted, the switch draws roughly 68 Watts combined on its two power supplies, and I find that
pretty cool considering the total throughput offered. Of course, once optics are inserted, the total
power draw will go up. Also worth noting, when the switch is under load, the Centec chip will
consume more power, for example when forwarding 8x10G + 2x100G, the total consumption was 88 Watts,
totally respectable now that datacenter power bills are skyrocketing.
### Topology
{{< image width="500px" float="right" src="/assets/oem-switch/topology.svg" alt="Front" >}}
On the heals of my [[DENOG14 Talk](/media/denog14/index.html)], in which I showed how VPP can
_route_ 150Mpps and 180Gbps on a 10 year old Dell while consuming a full 1M BGP table in 7 seconds
or so, I still had a little bit of a LAB left to repurpose. So I build the following topology using
the loadtester, packet analyzer, and switches:
* **msw-top**: S5624-2Z-EI switch
* **msw-core**: S5648X-2Q4ZA switch
* **msw-bottom**: S5624-2Z-EI switch
* All switches connect to:
* each other with 100G DACs (right, black)
* T-Rex machine with 4x10G (left, rainbow)
* Each switch gets a mgmt IPv4 and IPv6
With this topology I will have enough wiggle room to patch anything to anything. Now that the
physical part is out of the way, let's take a look at the firmware of these things!
### Software
As can be seen in the topology above, I am testing three of these switches - two are the smaller sibling
[S5624X 2Z-EI] (which come with 24x10G SFP+ and 2x100G QSFP28), and one is this [S5648X 2Q4Z]
pictured above. The vendor has a licensing system, for basic L2, basic L3 and advanced metro L3.
These switches come with the most advanced/liberal licenses, which means all of the features will
work on the switches, notably, MPLS/LDP and VPWS/VPLS.
Taking a look at the CLI, it's very Cisco IOS-esque; there's a few small differences, but the look
and feel is definitely familiar. Base configuration kind of looks like this:
#### Basic config
```
msw-core# show running-config
management ip address 192.168.1.33/24
management route add gateway 192.168.1.252
!
ntp server 216.239.35.4
ntp server 216.239.35.8
ntp mgmt-if enable
!
snmp-server enable
snmp-server system-contact noc@ipng.ch
snmp-server system-location Bruttisellen, Switzerland
snmp-server community public read-only
snmp-server version v2c
msw-core# conf t
msw-core(config)# stm prefer ipran
```
A few small things of note. There is no `mgmt0` device as I would've expected. Instead, the SoC
exposes its management interface to be configured with these `management ...` commands. The IPv4 can
be either DHCP or a static address, and IPv6 can only do static addresses. Only one (default)
gateway can be set for either protocol. Then, NTP can be set up to work on the `mgmt-if` which is a
useful way to use it for timekeeping.
The SNMP server works both from the `mgmt-if` and from the dataplane, which is nice.
SNMP supports everything you'd expect, including v3 and traps for all sorts of events, including
IPv6 targets and either dataplane or `mgmt-if`.
I did notice that the nameserver cannot use the `mgmt-if`, so I left it unconfigured. I found it a
little bit odd, considering all the other functionality does work just fine over the `mgmt-if`.
If you've run CAM-based systems before, you'll likely have come across some form of _partitioning_
mechanism, to allow certain types in the CAM (eg. IPv4, IPv6, L2 MACs, MPLS labels, ACLs) to have
more or fewer entries. This is particularly relevant on this switch because it has a comparatively
small CAM. It turns out, that by default MPLS is entirely disabled, and to turn it on (and sacrifice
some of that sweet sweet content addressable memory), I have to issue the command `stm prefer ipran`
(other flavors are _ipv6_, _layer3_, _ptn_, and _default_), and reload the switch.
Having been in the networking industry for a while, I scratched my head on the acronym **IPRAN**, so
I will admit having to look it up. It's a general term used to describe an IP based Radio Access
Network (2G, 3G, 4G or 5G) which uses IP as a transport layer technology. I find it funny in a
twisted sort of way, that to get the oldskool MPLS service, I have to turn on IPRAN.
Anyway, after changing the STM profile to _ipran_, the following partition is available:
**IPRAN** CAM | S5648X (msw-core) | S5624 (msw-top & msw-bottom)
---------------- | ---------------------------- | -----------------------------
MAC Addresses | 32k | 98k
IPv4 routes | host: 4k, indirect: 8k | host: 12k, indirect: 56k
IPv6 routes | host: 512, indirect: 512 | host: 2048, indirect: 1024
MPLS labels | 6656 | 6144
VPWS instances | 1024 | 1024
VPLS instances | 64 | 64
Port ACL entries | ingress: 1927, egress: 176 | ingress: 2976, egress: 928
VLAN ACL entries | ingress: 256, egress: 32 | ingress: 256, egress: 64
First off: there's quite a few differences here! The big switch has relatively few MAC, IPv4 and
IPv6 routes, compared to the little ones. But, it has a few more MPLS labels. ACL wise, the small
switch once again has a bit more capacity. But, of course the large switch has lots more ports (56
versus 26), and is more expensive. Choose wisely :)
Regarding IPv4/IPv6 and MPLS space, luckily [[AS8298]({% post_url 2021-02-27-network %})] is
relatively compact in its IGP. As of today, it carries 41 IPv4 and 48 IPv6 prefixes in OSPF, which
means that these switches would be fine participating in Area 0. If CAM space does turn into an
issue down the line, I can put them in stub areas and advertise only a default. As an aside, VPP
doesn't have any CAM at all, so for my routers the size is basically goverened by system memory
(which on modern computers equals "infinite routes"). As long as I keep it out of the DFZ, this
switch should be fine, for example in a BGP-free core that switches traffic based on VxLAN or MPLS,
but I digress.
#### L2
First let's test a straight forward configuration:
```
msw-top# configure terminal
msw-top(config)# vlan database
msw-top(config-vlan)# vlan 5-8
msw-top(config-vlan)# interface eth-0-1
msw-top(config-if)# switchport access vlan 5
msw-top(config-vlan)# interface eth-0-2
msw-top(config-if)# switchport access vlan 6
msw-top(config-vlan)# interface eth-0-3
msw-top(config-if)# switchport access vlan 7
msw-top(config-vlan)# interface eth-0-4
msw-top(config-if)# switchport mode dot1q-tunnel
msw-top(config-if)# switchport dot1q-tunnel native vlan 8
msw-top(config-vlan)# interface eth-0-26
msw-top(config-if)# switchport mode trunk
msw-top(config-if)# switchport trunk allowed vlan only 5-8
```
By means of demonstration, I created port `eth-0-4` as a QinQ capable port - which means that any
untagged frames coming into it will become VLAN 8, but any tagged frames will become s-tag 8 and
c-tag with whatever tag was sent, in other words standard issue QinQ tunneling. The configuration
of `msw-bottom` is exactly the same, and because we're connecting these VLANs through `msw-core`,
I'll have to make it a member of all these interfaces using the `interface range` shortcut:
```
msw-core# configure terminal
msw-core(config)# vlan database
msw-core(config-vlan)# vlan 5-8
msw-core(config-vlan)# interface range eth-0-49 - 50
msw-core(config-if)# switchport mode trunk
msw-core(config-if)# switchport trunk allowed vlan only 5-8
```
The loadtest results in T-Rex are, quite unsurprisingly, line rate. In the screenshot below, I'm
sending 128 byte frames at 8x10G (40G from `msw-top` through `msw-core` and out `msw-bottom`, and
40G in the other direction):
{{< image src="/assets/oem-switch/l2-trex.png" alt="L2 T-Rex" >}}
A few notes, for critical observers:
* I have to use 128 byte frames because the T-Rex loadtester is armed with 3x Intel x710 NICs,
which have a total packet rate of 40Mpps only. Intel made these with LACP redundancy in mind,
and do not recommend fully loading them. As 64b frames would be ~59.52Mpps, the NIC won't keep
up. So, I let T-Rex send 128b frames, which is ~33.8Mpps.
* T-Rex shows only the first 4 ports in detail, and you can see all four ports are sending 10Gbps
of L1 traffic, which at this frame size is 8.66Gbps of ethernet (as each frame also has a
24 byte overhead [[ref](https://en.wikipedia.org/wiki/Ethernet_frame)]). We can clearly see
though, that all Tx packets/sec are also Rx packets/sec, which means all traffic is safely
accounted for.
* In the top panel, you will see not 4x10, but **8x10Gbps and 67.62Mpps** of total throughput,
with no traffic lost, and the loadtester CPU well within limits: 👍
```
msw-top# show int summary | exc DOWN
RXBS: rx rate (bits/sec) RXPS: rx rate (pkts/sec)
TXBS: tx rate (bits/sec) TXPS: tx rate (pkts/sec)
Interface Link RXBS RXPS TXBS TXPS
-----------------------------------------------------------------------------
eth-0-1 UP 10016060422 8459510 10016060652 8459510
eth-0-2 UP 10016080176 8459527 10016079835 8459526
eth-0-3 UP 10015294254 8458863 10015294258 8458863
eth-0-4 UP 10016083019 8459529 10016083126 8459529
eth-0-25 UP 449 0 501 0
eth-0-26 UP 41362394687 33837608 41362394527 33837608
```
Clearly, all three switches are happy to forward 40Gbps in both directions, and the 100G port is
happy to forward (at least) 40G symmetric - and because the uplink port is trunked, each ethernet
frame will be 4 bytes longer due to the dot1q tag, which, at 128b frames means we'll be using
132/128 * 4 * 10G == 41.3G of traffic, which it spot on.
#### L3
For this test, I will reconfigure the 100G ports to become routed rather than switched. Remember,
`msw-top` connects to `msw-core`, which in turn connects to `msw-bottom`, so I'll need two IPv4 /31
and two IPv6 /64 transit networks. I'll also create a loopback interface with a stable IPv4 and IPv6
address on each switch, and I'll tie all of these together in IPv4 and IPv6 OSPF in Area 0. The
configuration for the `msw-top` switch becomes:
```
msw-top# configure terminal
interface loopback0
ip address 172.20.0.2/32
ipv6 address 2001:678:d78:400::2/128
ipv6 router ospf 8298 area 0
!
interface eth-0-26
description Core: msw-core eth-0-49
speed 100G
no switchport
mtu 9216
ip address 172.20.0.11/31
ipv6 address 2001:678:d78:400::2:2/112
ip ospf network point-to-point
ip ospf cost 1004
ipv6 ospf network point-to-point
ipv6 ospf cost 1006
ipv6 router ospf 8298 area 0
!
router ospf 8298
router-id 172.20.0.2
network 172.20.0.0/22 area 0
redistribute static
!
router ipv6 ospf 8298
router-id 172.20.0.2
redistribute static
```
Now that the IGP is up for IPv4 and IPv6 and I can ping the loopbacks from any switch to any other
switch, I can continue with the loadtest. I'll configure four IPv4 interfaces:
```
msw-top# configure terminal
interface eth-0-1
no switchport
ip address 100.65.1.1/30
!
interface eth-0-2
no switchport
ip address 100.65.2.1/30
!
interface eth-0-3
no switchport
ip address 100.65.3.1/30
!
interface eth-0-4
no switchport
ip address 100.65.4.1/30
!
ip route 16.0.1.0/24 100.65.1.2
ip route 16.0.2.0/24 100.65.2.2
ip route 16.0.3.0/24 100.65.3.2
ip route 16.0.4.0/24 100.65.4.2
```
After which I can see these transit networks and static routes propagate, through `msw-core`, and
into `msw-bottom`:
```
msw-bottom# show ip route
Codes: K - kernel, C - connected, S - static, R - RIP, B - BGP
O - OSPF, IA - OSPF inter area
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
E1 - OSPF external type 1, E2 - OSPF external type 2
i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area
Dc - DHCP Client
[*] - [AD/Metric]
* - candidate default
O 16.0.1.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56
O 16.0.2.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56
O 16.0.3.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56
O 16.0.4.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56
O 100.65.1.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
O 100.65.2.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
O 100.65.3.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
O 100.65.4.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
C 172.20.0.0/32 is directly connected, loopback0
O 172.20.0.1/32 [110/1005] via 172.20.0.9, eth-0-26, 05:50:48
O 172.20.0.2/32 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56
C 172.20.0.8/31 is directly connected, eth-0-26
C 172.20.0.8/32 is in local loopback, eth-0-26
O 172.20.0.10/31 [110/1018] via 172.20.0.9, eth-0-26, 05:50:48
```
I now instruct the T-Rex loadtester to send single-flow loadtest traffic from 16.0.X.1 -> 48.0.X.1
on port 0; and back from 48.0.X.1 -> 16.0.X.1 on port 1; and then for port2+3 I use X=2, for
port4+5 I will use X=3, and port 6+7 I will use X=4. After T-Rex starts up, it's sending 80Gbps of
traffic with a grand total of 67.6Mpps in 8 unique flows of 8.45Mpps at 128b each, and the three
switches forward this L3 IPv4 unicast traffic effortlessly:
{{< image src="/assets/oem-switch/l3-trex.png" alt="L3 T-Rex" >}}
#### Overlay
What I've built just now would be acceptable really only if the switches were in the same rack (or
at best, facility). As an industry professional, I frown upon things like _VLAN-stretching_, a term
that describes bridging VLANs between buildings (or, as some might admit to .. between cities or
even countries🤮). A long time ago (in December 1999), Luca Martini invented what is now called [[Martini
Tunnels](https://datatracker.ietf.org/doc/html/draft-martini-l2circuit-trans-mpls-00)], defining how
to transport Ethernet frames over an MPLS network, which is what I really want to demonstrate, albeit
in the next article.
What folks don't always realize is that the industry is _moving on_ from MPLS to a set of more
flexible IP based solutions, notably tunneling using IPv4 or IPv6 UDP packets such as found in VxLAN
or GENEVE, two of my favorite protocols. This certainly does cost a little bit in VPP, as I wrote
about in my post on [[VLLs in VPP]({% post_url 2022-01-12-vpp-l2 %})], although you'd be surprised
how many VxLAN encapsulated packets/sec a simple AMD64 router can forward. With respect to these
switches, though, let's find out if tunneling this way incurs an overhead or performance penalty.
Ready? Let's go!
First I will put the first four interfaces in range `eth-0-1 - 4` into a new set of VLANs, but in the
VLAN database I will enable what is called _overlay_ on them:
```
msw-top# configure terminal
vlan database
vlan 5-8,10,20,30,40
vlan 10 name v-vxlan-xco10
vlan 10 overlay enable
vlan 20 name v-vxlan-xco20
vlan 20 overlay enable
vlan 30 name v-vxlan-xco30
vlan 30 overlay enable
vlan 40 name v-vxlan-xco40
vlan 40 overlay enable
!
interface eth-0-1
switchport access vlan 10
!
interface eth-0-2
switchport access vlan 20
!
interface eth-0-3
switchport access vlan 30
!
interface eth-0-4
switchport access vlan 40
```
Next, I create two new loopback interfaces (bear with me on this one), and configure the transport
of these overlays in the switch. This configuration will pick up the VLANs and move them to remote sites
in either VxLAN, GENEVE or NvGRE protocol, like this:
```
msw-top# configure terminal
!
interface loopback1
ip address 172.20.1.2/32
!
interface loopback2
ip address 172.20.2.2/32
!
overlay
remote-vtep 1 ip-address 172.20.0.0 type vxlan src-ip 172.20.0.2
remote-vtep 2 ip-address 172.20.1.0 type nvgre src-ip 172.20.1.2
remote-vtep 3 ip-address 172.20.2.0 type geneve src-ip 172.20.2.2 keep-vlan-tag
vlan 10 vni 829810
vlan 10 remote-vtep 1
vlan 20 vni 829820
vlan 20 remote-vtep 2
vlan 30 vni 829830
vlan 30 remote-vtep 3
vlan 40 vni 829840
vlan 40 remote-vtep 1
!
```
Alright, this is seriously cool! The first overlay defines what is called a remote `VTEP` (virtual
tunnel end point), of type `VxLAN` towards IPv4 address `172.20.0.0`, coming from source address
`172.20.0.2` (which is our `loopback0` interface on switch `msw-top`). As it turns out, I am not
allowed to create different overlay _types_ to the same _destination_ address, but not to worry: I
can create a few unique loopback interfaces with unique IPv4 addresses (see `loopback1` and
`loopback2`; and create new VTEPs using these. So, VTEP at index 2 is of type `NvGRE` and the one at
index 3 is of type `GENEVE` and due to the use of `keep-vlan-tag`, the encapsulated traffic will
carry dot1q tags, where-as in the other two VTEPs the tag will be stripped and what is transported
on the wire is untagged traffic.
```
msw-top# show vlan all
VLAN ID Name State STP ID Member ports
(u)-Untagged, (t)-Tagged
======= =============================== ======= ======= ========================
(...)
10 v-vxlan-xco10 ACTIVE 0 eth-0-1(u)
VxLAN: 172.20.0.2->172.20.0.0
20 v-vxlan-xco20 ACTIVE 0 eth-0-2(u)
NvGRE: 172.20.1.2->172.20.1.0
30 v-vxlan-xco30 ACTIVE 0 eth-0-3(u)
GENEVE: 172.20.2.2->172.20.2.0
40 v-vxlan-xco40 ACTIVE 0 eth-0-4(u)
VxLAN: 172.20.0.2->172.20.0.0
msw-top# show mac address-table
Mac Address Table
-------------------------------------------
(*) - Security Entry (M) - MLAG Entry
(MO) - MLAG Output Entry (MI) - MLAG Input Entry
(E) - EVPN Entry (EO) - EVPN Output Entry
(EI) - EVPN Input Entry
Vlan Mac Address Type Ports
---- ----------- -------- -----
10 6805.ca32.4595 dynamic VxLAN: 172.20.0.2->172.20.0.0
10 6805.ca32.4594 dynamic eth-0-1
20 6805.ca32.4596 dynamic NvGRE: 172.20.1.2->172.20.1.0
20 6805.ca32.4597 dynamic eth-0-2
30 9c69.b461.7679 dynamic GENEVE: 172.20.2.2->172.20.2.0
30 9c69.b461.7678 dynamic eth-0-3
40 9c69.b461.767a dynamic VxLAN: 172.20.0.2->172.20.0.0
40 9c69.b461.767b dynamic eth-0-4
```
Turning my attention to the VLAN database, I can now see the power of this become obvious. This
switch has any number of local interfaces either tagged or untagged (in the case of VLAN 10 we can
see `eth-0-1(u)` which means that interface is participating in the VLAN untagged), but we can also
see that this VLAN 10 has a member port called `VxLAN: 172.20.0.2->172.20.0.0`. This port is just
like any other, in that it'll participate in unknown unicast, broadcast and multicast, and "learn"
MAC addresses behind these virtual overlay ports. In VLAN 10 (and VLAN 40), I can see in the L2 FIB
(`show mac address-table`), that there's a local MAC address learned (from the T-Rex loadtester)
behind `eth-0-1`, but there's also a remote MAC address learned behind the VxLAN port. I'm
impressed.
I can add any number of VLANs (and dot1q-tunnels) into a VTEP endpoint, after assigning each of them
a unique `VNI` (virtual network identifier). If you're curious about these, take a look at the
[[VxLAN](https://datatracker.ietf.org/doc/html/rfc7348)], [[GENEVE](https://datatracker.ietf.org/doc/html/rfc8926)]
and [[NvGRE](https://www.rfc-editor.org/rfc/rfc7637.html)] specifications. Basically, the encapsulation is just putting the
ethernet frame as a payload of an UDP packet, so let's take a look at those.
#### Inspecting overlay
As you'll recall, the VLAN 10,20,30,40 trafffic is now traveling over an IP network, notably
encapsulated by the source switch `msw-top` and delivered to `msw-bottom` via IGP (in my case,
OSPF), while it transits through `msw-core`. I decide to take a look at this, by configuring a
monitor port on `msw-core`:
```
msw-core# show run | inc moni
monitor session 1 source interface eth-0-49 both
monitor session 1 destination interface eth-0-1
```
This will copy all in- and egress traffic from interface `eth-0-49` (connected to `msw-top`) through
to local interface `eth-0-1`, which is connected to the loadtester. I can simply tcpdump this stuff:
```
pim@trex01:~$ sudo tcpdump -ni eno2 '(proto gre) or (udp and port 4789) or (udp and port 6081)'
01:26:24.685666 00:1e:08:26:ec:f3 > 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 174:
(tos 0x0, ttl 127, id 7496, offset 0, flags [DF], proto UDP (17), length 160)
172.20.0.0.49208 > 172.20.0.2.4789: VXLAN, flags [I] (0x08), vni 829810
68:05:ca:32:45:95 > 68:05:ca:32:45:94, ethertype IPv4 (0x0800), length 124:
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110)
48.0.1.47.1025 > 16.0.1.47.12: UDP, length 82
01:26:24.688305 00:1e:08:0d:6e:88 > 00:1e:08:26:ec:f3, ethertype IPv4 (0x0800), length 166:
(tos 0x0, ttl 128, id 44814, offset 0, flags [DF], proto GRE (47), length 152)
172.20.1.2 > 172.20.1.0: GREv0, Flags [key present], key=0xca97c38, proto TEB (0x6558), length 132
68:05:ca:32:45:97 > 68:05:ca:32:45:96, ethertype IPv4 (0x0800), length 124:
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110)
48.0.2.73.1025 > 16.0.2.73.12: UDP, length 82
01:26:24.689100 00:1e:08:26:ec:f3 > 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 178:
(tos 0x0, ttl 127, id 7502, offset 0, flags [DF], proto UDP (17), length 164)
172.20.2.0.49208 > 172.20.2.2.6081: GENEVE, Flags [none], vni 0xca986, proto TEB (0x6558)
9c:69:b4:61:76:79 > 9c:69:b4:61:76:78, ethertype 802.1Q (0x8100), length 128: vlan 30, p 0, ethertype IPv4 (0x0800),
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110)
48.0.3.109.1025 > 16.0.3.109.12: UDP, length 82
01:26:24.701666 00:1e:08:0d:6e:89 > 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 174:
(tos 0x0, ttl 127, id 7496, offset 0, flags [DF], proto UDP (17), length 160)
172.20.0.0.49208 > 172.20.0.2.4789: VXLAN, flags [I] (0x08), vni 829840
68:05:ca:32:45:95 > 68:05:ca:32:45:94, ethertype IPv4 (0x0800), length 124:
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110)
48.0.4.47.1025 > 16.0.4.47.12: UDP, length 82
```
We can see packets for all four tunnels in this dump. The first one is a UDP packet to port 4789,
which is the standard port for VxLAN, and it has VNI 829810. The second packet is proto GRE with
flag `TEB` which stands for _transparent ethernet bridge_ in other words an L2 variant of GRE that
carries ethernet frames. The third one shows that feature I configured above (in case you forgot it,
it's the `keep-vlan-tag` option when creating the VTEP), and because of that flag we can see that
the inner payload carries the `vlan 30` tag, neat! The `VNI` there is `0xca986` which is hex for
`829830`. Finally, the fourth one shows VLAN40 traffic that is sent to the same VTEP endpoint as
VLAN10 traffic (showing that multiple VLANs can be transported across the same tunnel, distinguished
by VNI).
{{< image width="90px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
At this point I make an important observation. VxLAN and GENEVE both have this really cool feature
that they can hash their _inner_ payload (ie. the IPv4/IPv6 address and ports if available) and use
that to randomize the _source port_, which makes them preferable to GRE. The reason why this is
preferable is hashing makes these inner flows become unique outer flows, which in turn allows them
to be loadbalanced in intermediate networks, but also in the receiver if it has multiple receive
queues. However, and this is important!, the switch does not hash, which means that all ethernet
traffic in the VxLAN, GENEVE and NvGRE tunnels always have the exact same outer header, so
loadbalancing and multiple receive queues are out of the question. I wonder if this is a limitation
of the Centec chip, or failure to program or configure it by the firmware.
With that gripe out of the way, let's take a look at 80Gbit of tunneled traffic, shall we?
{{< image src="/assets/oem-switch/overlay-trex.png" alt="Overlay T-Rex" >}}
Once again, all three switches are acing it. So at least 40Gbps of encap- and 40Gbps of decapsulation
per switch, and the transport over IPv4 through the `msw-core` switch to the other side, is all in
working order. On top of that, I've shown that multiple types of overlay can live alongside one
another, even betwween the same pair of switches, and that multiple VLANs can share the same underlay
transport. The only downside is the **single flow** nature of these UDP transports.
A final inspection of the switch throughput:
```
msw-top# show interface summary | exc DOWN
RXBS: rx rate (bits/sec) RXPS: rx rate (pkts/sec)
TXBS: tx rate (bits/sec) TXPS: tx rate (pkts/sec)
Interface Link RXBS RXPS TXBS TXPS
-----------------------------------------------------------------------------
eth-0-1 UP 10013004482 8456929 10013004548 8456929
eth-0-2 UP 10013030687 8456951 10013030801 8456951
eth-0-3 UP 10012625863 8456609 10012626030 8456609
eth-0-4 UP 10013032737 8456953 10013034423 8456954
eth-0-25 UP 505 0 513 0
eth-0-26 UP 51147539721 33827761 51147540111 33827762
```
Take a look at that `eth-0-26` interface: it's using significantly more bandwidth (51Gbps) than
the sum of the four transports (4x10Gbps). This is because each ethernet frame (of 128b) has to be
wrapped in an IPv4 UDP packet (or in the case of NvGRE an IPv4 packet with a GRE header), which
incurs quite some overhead, for small packets at least. But it definitely proves that the switches
here are happy to do this forwarding at line rate, and that's what counts!
### Conclusions
It's just super cool to see a switch like this work as expected. I did not manage to overload it at
all, neither with IPv4 loadtest at 67Mpps and 80Gbit of traffic, nor with L2 loadtest with four ports
transported with VxLAN, NvGRE and GENEVE, _at the same time_. Although the underlay can only use
IPv4 (no IPv6 is available in the switch chip), this is not a huge problem for me. At AS8298, I can
easily define some private VRF with IPv4 space from RFC1918 to do the transport of traffic over
VxLAN. And what's even better, this can perfectly inter-operate with my VPP routers which also do
VxLAN en/decapsulation.
Now there is one more thing for me to test (and, cliffhanger, I've tested it already but I'll have
to write up all of my data and results ...). I need to do what I said I would do in the beginning of
this article, and what I had hoped to achieve with the FS switches but failed to due to lack of
support: MPLS L2VPN transport (and, its more complex but cooler sibling VPLS).

View File

@ -0,0 +1,552 @@
---
date: "2022-12-09T11:56:54Z"
title: 'Review: S5648X-2Q4Z Switch - Part 2: MPLS'
---
After receiving an e-mail from a newer [[China based OEM](https://starry-networks.com)], I had a chat with their
founder and learned that the combination of switch silicon and software may be a good match for IPng Networks.
I got pretty enthusiastic when this new vendor claimed VxLAN, GENEVE, MPLS and GRE at 56 ports and
line rate, on a really affordable budget ($4'200,- for the 56 port; and $1'650,- for the 26 port
switch). This reseller is using a less known silicon vendor called
[[Centec](https://www.centec.com/silicon)], who have a lineup of ethernet silicon. In this device,
the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired
with 4x100GbE uplink capability. This is Centec's fourth generation, so CTC8096 inherits the feature
set from L2/L3 switching to advanced data center and metro Ethernet features with innovative
enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE
ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS
SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic
management, and Network time synchronization.
{{< image width="450px" float="left" src="/assets/oem-switch/S5624X-front.png" alt="S5624X Front" >}}
{{< image width="450px" float="right" src="/assets/oem-switch/S5648X-front.png" alt="S5648X Front" >}}
<br /><br />
After discussing basic L2, L3 and Overlay functionality in my [[previous post]({% post_url
2022-12-05-oem-switch-1 %})], I left somewhat of a cliffhanger alluding to all this fancy MPLS and
VPLS stuff. Honestly, I needed a bit more time to play around with the featureset and clarify a few things.
I'm now ready to assert that this stuff is really possible on this switch, and if this tickles your fancy, by
all means read on :)
## Detailed findings
### Hardware
{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-front-opencase.png" alt="Front" >}}
The switch comes well packaged with two removable 400W Gold powersupplies from _Compuware
Technology_ which output 12V/33A and +5V/3A as well as four removable PWM controlled fans from
_Protechnic_. The switch chip is a Centec
[[CTC8096](https://www.centec.com/silicon/Ethernet-Switch-Silicon/CTC8096)] which is a competent silicon unit that can offer
48x10, 2x40 and 4x100G, and its smaller sibling carries the newer
[[CTC7132](https://www.centec.com/silicon/Ethernet-Switch-Silicon/CTC7132)] from 2019, which brings
24x10 and 2x100G connectivity. While the firmware seems slightly different in denomination, the large
one shows `NetworkOS-e580-v7.4.4.r.bin` as the firmware, and the smaller one shows
`uImage-v7.0.4.40.bin`, I get the impression that the latter is a compiled down version of the
former to work with the newer chipset.
In my [[previous post]({% post_url 2022-12-05-oem-switch-1 %})], I showed L2, L3 and VxLAN, GENEVE
and NvGRE capabilities of this switch to be line rate. But the hardware also supports MPLS, so I
figured I'd complete the Overlay series by exploring VxLAN, and the MPLS, EoMPLS (L2VPN, Martini
style), and VPLS functionality of these units.
### Topology
![Front](/assets/oem-switch/topology.svg){: style="width:500px; float: right; margin-left: 1em;
margin-bottom: 1em;"}
In the [[IPng Networks LAB]({% post_url 2022-10-14-lab-1 %})], I build the following topology using
the loadtester, packet analyzer, and switches:
* **msw-top**: S5624-2Z-EI switch
* **msw-core**: S5648X-2Q4ZA switch
* **msw-bottom**: S5624-2Z-EI switch
* All switches connect to:
* each other with 100G DACs (right, black)
* T-Rex machine with 4x10G (left, rainbow)
* Each switch gets a mgmt IPv4 and IPv6
This is the same topology in the previous post, and it gives me lots of wiggle room to patch anything
to anything as I build point to point MPLS tunnels, VPLS clouds and eVPN overlays. Although I will
also load/stress test these configurations, this post is more about the higher level configuration
work that goes into building such an MPLS enabled telco network.
### MPLS
Why even bother, if we have these fancy new IP based transports that I [[wrote about]({% post_url
2022-12-05-oem-switch-1 %})] last week? I mentioned that the industry is _moving on_ from MPLS
to a set of more flexible IP based solutions like VxLAN and GENEVE, as they certainly offer lots of
benefits in deployment (notably as overlays on top of existing IP networks).
Here's one plausible answer: You may have come across an architectural network design concept known
as [[BGP Free Core](https://bgphelp.com/2017/02/12/bgp-free-core/)] and operating this way gives
very little room for outages to occur in the L2 (Ethernet and MPLS) transport network, because it's
relatively simple in design and implementation. Some advantages worth mentioning:
* Transport devices do not need to be capable of supporting a large number of IPv4/IPv6 routes, either
in the RIB or FIB, allowing them to be much cheaper.
* As there is no eBGP, transport devices will not be impacted by BGP-related issues, such as high CPU
utilization during massive BGP re-convergence.
* Also, without eBGP, some of the attack vectors in ISPs (loopback DDoS or ARP storms on public
internet exchange, to take two common examples) can be eliminated. If a new BGP security
vulnerability were to be discovered, transport devices aren't impacted.
* Operator errors (the #1 reason for outages in our industry) associated with BGP configuration and
the use of large RIBs (eg. leaking into IGP, flapping transit sessions, etc) can be eradicated.
* New transport services such as MPLS point to point virtual leased lines, SR-MPLS, VPLS clouds, and
eVPN can all be introduced without modifying the routing core.
If deployed correctly, this type of transport-only network can be kept entirely isolated from the Internet,
making DDoS and hacking attacks against transport elements impossible, and it also opens up possibilities
for relatively safe sharing of infrastructure resources between ISPs (think of things like dark fibers
between locations, rackspace, power, cross connects).
For smaller clubs (like IPng Networks), being able to share a 100G wave with others, significantly reduces
price per Megabit! So if you're in Zurich, Switzerland, or Europe and find this an interesting avenue to
expand your reach in a co-op style environment, [[reach out](/s/contact)] to us, any time!
#### MPLS + LDP Configuration
OK, let's talk bits and bytes. Table stakes functionality is of course MPLS switching and label distribution,
which is performed with LDP, described in [[RFC3036](https://www.rfc-editor.org/rfc/rfc3036.html)].
Enabling these features is relatively straight forward:
```
msw-top# show run int loop0
interface loopback0
ip address 172.20.0.2/32
ipv6 address 2001:678:d78:400::2/128
ipv6 router ospf 8298 area 0
msw-top# show run int eth-0-25
interface eth-0-25
description Core: msw-bottom eth-0-25
speed 100G
no switchport
mtu 9216
label-switching
ip address 172.20.0.12/31
ipv6 address 2001:678:d78:400::3:1/112
ip ospf network point-to-point
ip ospf cost 104
ipv6 ospf network point-to-point
ipv6 ospf cost 106
ipv6 router ospf 8298 area 0
enable-ldp
msw-top# show run router ospf
router ospf 8298
network 172.20.0.0/24 area 0
msw-top# show run router ipv6 ospf
router ipv6 ospf 8298
router-id 172.20.0.2
msw-top# show run router ldp
router ldp
router-id 172.20.0.2
transport-address 172.20.0.2
```
This seems like a mouthful, but really not too complicated. From the top, I create a loopback
interface with an IPv4 (/32) and IPv6 (/128) address. Then, on the 100G transport interfaces, I
specify an IPv4 (/31, let's not be wasteful, take a look at [[RFC
3021](https://www.rfc-editor.org/rfc/rfc3021.html)]) and IPv6 (/112) transit network, after which I
add the interface to OSPF and OSPFv3.
The main two things to note in the interface definition is the use of `label-switching` which enables
MPLS on the interface, and `enable-ldp` which makes it periodically multicast LDP discovery
packets. If another device is also doing that, an LDP _adjacency_ is formed using a TCP session.
The two devices then exchange MPLS label tables, so that they learn from each other how to switch
MPLS packets across the network.
LDP _signalling_ kind of looks like this on the wire:
```
14:21:43.741089 IP 172.20.0.12.646 > 224.0.0.2.646: LDP, Label-Space-ID: 172.20.0.2:0, pdu-length: 30
14:21:44.331613 IP 172.20.0.13.646 > 224.0.0.2.646: LDP, Label-Space-ID: 172.20.0.1:0, pdu-length: 30
14:21:44.332773 IP 172.20.0.2.36475 > 172.20.0.1.646: Flags [S],seq 195175, win 27528,
options [mss 9176,sackOK,TS val 104349486 ecr 0,nop,wscale 7], length 0
14:21:44.333700 IP 172.20.0.1.646 > 172.20.0.2.36475: Flags [S.], seq 466968, ack 195176, win 18328,
options [mss 9176,sackOK,TS val 104335979 ecr 104349486,nop,wscale 7], length 0
14:21:44.334313 IP 172.20.0.2.36475 > 172.20.0.1.646: Flags [.], ack 1, win 216,
options [nop,nop,TS val 104349486 ecr 104335979], length 0
```
The first two packets here are the routers announcing to [[well known
multicast](https://en.wikipedia.org/wiki/Multicast_address)] address for _all-routers_ (224.0.0.2),
and well known port 646 (for LDP), in a packet called a _Hello Message_. The router with
address 172.20.0.12 is the one we just configured (`msw-top`), and the one with address 172.20.0.13 is
the other side (`msw-bottom`). In these _Hello messages_, the router informs multicast listeners
where they should connect (called the _IPv4 transport address_), in the case of `msw-top`, it's
172.20.0.2.
Now that they've noticed one anothers willingness to form an adjacency, a TCP connection is
initiated from our router's loopback address (specified by `transport-address` in the LDP
configuration), towards the loopback that was learned from the _Hello Message_ in the multicast
packet earlier. A TCP three way handshake follows, in which the routers also tell each other their
MTU (by means of the MSS field set to 9176, which is 9216 minus 20 bytes [[IPv4
header](https://en.wikipedia.org/wiki/Internet_Protocol_version_4)] and 20 bytes [[TCP
header](https://en.wikipedia.org/wiki/Transmission_Control_Protocol)]). The adjacency forms and both
routers exchange label information (in things called a _Label Mapping Message_). Once done
exchanging this info, `msw-top` can now switch MPLS packets across its two 100G interfaces.
Zooming back out from what happened on the wire with the LDP _signalling_, I can take a look at the
`msw-top` switch: besides the adjacency that I described in detail above, another one has formed over
the IPv4 transit network between `msw-top` and `msw-core` (refer to the topology diagram to see what
connects where). As this is a layer3 network, icky things like spanning tree and forwarding loops
are no longer an issue. Any switch can forward MPLS packets to any neighbor in this topology, preference
on the used path is informed with OSPF costs for the IPv4 interfaces (because LDP is using IPv4 here).
```
msw-top# show ldp adjacency
IP Address Intf Name Holdtime LDP-Identifier
172.20.0.10 eth-0-26 15 172.20.0.1:0
172.20.0.13 eth-0-25 15 172.20.0.0:0
msw-top# show ldp session
Peer IP Address IF Name My Role State KeepAlive
172.20.0.0 eth-0-25 Active OPERATIONAL 30
172.20.0.1 eth-0-26 Active OPERATIONAL 30
```
#### MPLS pseudowire
The easiest form (and possibly most widely used one), is to create a point to point ethernet link
betwen an interface on one switch, through the MPLS network, and into another switch's interface on
the other side. Think of this as a really long network cable. Ethernet frames are encapsulated into
an MPLS frame, and passed through the network though some sort of tunnel, called a _pseudowire_.
There are many names of this tunneling technique. Folks refer to them as PWs (PseudoWires), VLLs
(Virtual Leased Lines), Carrier Ethernet, or Metro Ethernet. Luckily, these are almost always
interoperable, because under the covers, the vendors are implementing these MPLS cross connect
circuits using [[Martini Tunnels](https://datatracker.ietf.org/doc/html/draft-martini-l2circuit-trans-mpls-00)]
which were formalized in [[RFC 4447](https://datatracker.ietf.org/doc/html/rfc4447)].
The way Martini tunnels work is by creating an extension in LDP signalling. An MPLS label-switched-path is
annotated as being of a certain type, carrying a 32 bit _pseudowire ID_, which is ignored by all
intermediate routers (they will just switch the MPLS packet onto the next hop), but the last router
will inspect the MPLS packet and find which _pseudowire ID_ it belongs to, and look up in its local
table what to do with it (mostly just unwrap the MPLS packet, and marshall the resulting ethernet
frame into an interface or tagged sub-interface).
Configuring the _pseudowire_ is really simple:
```
msw-top# configure terminal
interface eth-0-1
mpls-l2-circuit pw-vll1 ethernet
!
mpls l2-circuit pw-vll1 829800 172.20.0.0 raw mtu 9000
msw-top# show ldp mpls-l2-circuit 829800
Transport Client VC Trans Local Remote Destination
VC ID Binding State Type VC Label VC Label Address
829800 eth-0-1 UP Ethernet 32774 32773 172.20.0.0
```
After I've configured this on both `msw-top` and `msw-bottom`, using LDP signalling, a new LSP will be
set up which carries ethernet packets at up to 9000 bytes, encapsulated MPLS, over the network. To
show this in more detail, I'll take the two ethernet interfaces that are connected to `msw-top:eth-0-1`
and `msw-bottom:eth-0-1`, and move them in their own network namespace on the lab machine:
```
root@dut-lab:~# ip netns add top
root@dut-lab:~# ip netns add bottom
root@dut-lab:~# ip link set netns top enp66s0f0
root@dut-lab:~# ip link set netns bottom enp66s0f1
```
I can now enter the _top_ and _bottom_ namespaces, and play around with those interfaces, for example
I'll give them an IPv4 address and a sub-interface with dot1q tag 1234 and an IPv6 address:
```
root@dut-lab:~# nsenter --net=/var/run/netns/bottom
root@dut-lab:~# ip addr add 192.0.2.1/31 dev enp66s0f1
root@dut-lab:~# ip link add link enp66s0f1 name v1234 type vlan id 1234
root@dut-lab:~# ip addr add 2001:db8::2/64 dev v1234
root@dut-lab:~# ip link set v1234 up
root@dut-lab:~# nsenter --net=/var/run/netns/top
root@dut-lab:~# ip addr add 192.0.2.0/31 dev enp66s0f0
root@dut-lab:~# ip link add link enp66s0f0 name v1234 type vlan id 1234
root@dut-lab:~# ip addr add 2001:db8::1/64 dev v1234
root@dut-lab:~# ip link set v1234 up
root@dut-lab:~# ping -c 5 2001:db8::2
PING 2001:db8::2(2001:db8::2) 56 data bytes
64 bytes from 2001:db8::2: icmp_seq=1 ttl=64 time=0.158 ms
64 bytes from 2001:db8::2: icmp_seq=2 ttl=64 time=0.155 ms
64 bytes from 2001:db8::2: icmp_seq=3 ttl=64 time=0.162 ms
```
The `mpls-l2-circuit` that I created will transport the received ethernet frames between `enp66s0f0`
(in the _top_ namespace) and `enp66s0f1` (in the _bottom_ namespace), using MPLS encapsulation, and
giving the packets a stack of _two_ labels. The outer most label helps the switches determine where
to switch the MPLS packet (in other words, route it from `msw-top` to `msw-bottom`). Once the
destination is reached, the outer label is popped off the stack, to reveal the second label, the
purpose of which is to tell the `msw-bottom` switch what, preciesly, to do with this payload. The
switch will find that the second label instructs it to transmit the MPLS payload as an ethernet
frame out on port `eth-0-1`.
If I want to look at what happens on the wire with tcpdump(8), I can use the monitor port on
`msw-core` which mirrors all packets transiting through it. But, I don't get very far:
```
root@dut-lab:~# tcpdump -evni eno2 mpls
19:57:37.055854 00:1e:08:0d:6e:88 > 00:1e:08:26:ec:f3, ethertype MPLS unicast (0x8847), length 144:
MPLS (label 32768, exp 0, ttl 255) (label 32773, exp 0, [S], ttl 255)
0x0000: 9c69 b461 7679 9c69 b461 7678 8100 04d2 .i.avy.i.avx....
0x0010: 86dd 6003 4a42 0040 3a40 2001 0db8 0000 ..`.JB.@:@......
0x0020: 0000 0000 0000 0000 0001 2001 0db8 0000 ................
0x0030: 0000 0000 0000 0000 0002 8000 3553 9326 ............5S.&
0x0040: 0001 2185 9363 0000 0000 e7d9 0000 0000 ..!..c..........
0x0050: 0000 1011 1213 1415 1617 1819 1a1b 1c1d ................
0x0060: 1e1f 2021 2223 2425 2627 2829 2a2b 2c2d ...!"#$%&'()*+,-
0x0070: 2e2f 3031 3233 3435 3637 ./01234567
19:57:37.055890 00:1e:08:26:ec:f3 > 00:1e:08:0d:6e:88, ethertype MPLS unicast (0x8847), length 140:
MPLS (label 32774, exp 0, [S], ttl 254)
0x0000: 9c69 b461 7678 9c69 b461 7679 8100 04d2 .i.avx.i.avy....
0x0010: 86dd 6009 4122 0040 3a40 2001 0db8 0000 ..`.A".@:@......
0x0020: 0000 0000 0000 0000 0002 2001 0db8 0000 ................
0x0030: 0000 0000 0000 0000 0001 8100 3453 9326 ............4S.&
0x0040: 0001 2185 9363 0000 0000 e7d9 0000 0000 ..!..c..........
0x0050: 0000 1011 1213 1415 1617 1819 1a1b 1c1d ................
0x0060: 1e1f 2021 2223 2425 2627 2829 2a2b 2c2d ...!"#$%&'()*+,-
0x0070: 2e2f 3031 3233 3435 3637 ./01234567
```
For a brief moment, I stare closely at the first part of the hex dump, and I recognize two MAC addresses
`9c69.b461.7678` and `9c69.b461.7679` followed by what appears to be `0x8100` (the ethertype for
[[Dot1Q](https://en.wikipedia.org/wiki/IEEE_802.1Q)]) and then `0x04d2` (which is 1234 in decimal,
the VLAN tag I chose).
Clearly, the hexdump here is "just" an ethernet frame. So why doesn't tcpdump decode it? The answer is simple:
nothing in the MPLS packet tells me that the payload is actually ethernet. It could be anything, and
it's really up to the recipient of the packet with the label 32773 to determine what its payload means.
Luckily, Wireshark can be prompted to decode further based on which MPLS label is present. Using the
_Decode As..._ option, I can specify that data following label 32773 is _Ethernet PW (no CW)_, where
PW here means _pseudowire_ and CW means _controlword_. Et, voil&agrave;, the first packet reveals itself:
{{< image src="/assets/oem-switch/mpls-wireshark-1.png" alt="MPLS Frame #1 disected" >}}
#### Pseudowires on Sub Interfaces
One very common use case for me at IPng Networks is to work with excellent partners like
[[IP-Max](https://www.ip-max.net/)] who provide Internet Exchange transport, for example from DE-CIX
or SwissIX, to the customer premises. IP-Max uses Cisco's ASR9k routers, an absolutely beautiful piece
of technology [[ref]({% post_url 2022-02-21-asr9006 %})], and with those you can terminate a _L2VPN_ in
any sub-interface.
Let's configure something similar. I take one port on `msw-top`, and branch that out into three
remote locations, in this case `msw-bottom` port 1, 2 and 3. I will be terminating all three _pseudowires_
on the same endpoint, but obviously this could also be one port that goes to three internet exchanges,
say SwissIX, DE-CIX and FranceIX, on three different endpoints.
The configuration for both switches will look like this:
```
msw-top# configure terminal
interface eth-0-1
switchport mode trunk
switchport trunk native vlan 5
switchport trunk allowed vlan add 6-8
mpls-l2-circuit pw-vlan10 vlan 10
mpls-l2-circuit pw-vlan20 vlan 20
mpls-l2-circuit pw-vlan30 vlan 30
mpls l2-circuit pw-vlan10 829810 172.20.0.0 raw mtu 9000
mpls l2-circuit pw-vlan20 829820 172.20.0.0 raw mtu 9000
mpls l2-circuit pw-vlan30 829830 172.20.0.0 raw mtu 9000
msw-bottom# configure terminal
interface eth-0-1
mpls-l2-circuit pw-vlan10 ethernet
interface eth-0-2
mpls-l2-circuit pw-vlan20 ethernet
interface eth-0-3
mpls-l2-circuit pw-vlan30 ethernet
mpls l2-circuit pw-vlan10 829810 172.20.0.2 raw mtu 9000
mpls l2-circuit pw-vlan20 829820 172.20.0.2 raw mtu 9000
mpls l2-circuit pw-vlan30 829830 172.20.0.2 raw mtu 9000
```
Previously, I configured the port in _ethernet_ mode, which takes all frames and forwards them into
the MPLS tunnel. In this case, I'm using _vlan_ mode, specifying a VLAN tag that, when frames arrive
on the port matching it, will selectively be put into a pseudowire. As an added benefit, this allows
me to still use the port as a regular switchport, in the snippet above it will take untagged frames
and assign them to VLAN 5, allow tagged frames with dot1q VLAN tag 6, 7 or 8, and handle them as any
normal switch would. VLAN tag 10, however, is directed into the pseudowire called _pw-vlan10_, and
the other two tags similarly get put into their own `l2-circuit`. Using LDP signalling, the _pw-id_
(829810, 829820, and 829830) determines which label is assigned. On the way back, that label
allows the switch to correlate the ethernet frame with the correct port and transmit the it with the
configured VLAN tag.
To show this from an end-user point of view, let's take a look at the Linux server connected to these
switches. I'll put one port in a namespace called _top_, and three other ports in a network namespace
called _bottom_, and then proceed to give them a little bit of config:
```
root@dut-lab:~# ip link set netns top dev enp66s0f0
root@dut-lab:~# ip link set netns bottom dev enp66s0f1
root@dut-lab:~# ip link set netns bottom dev enp66s0f2
root@dut-lab:~# ip link set netns bottom dev enp4s0f1
root@dut-lab:~# nsenter --net=/var/run/netns/top
root@dut-lab:~# ip link add link enp66s0f0 name v10 type vlan id 10
root@dut-lab:~# ip link add link enp66s0f0 name v20 type vlan id 20
root@dut-lab:~# ip link add link enp66s0f0 name v30 type vlan id 30
root@dut-lab:~# ip addr add 192.0.2.0/31 dev v10
root@dut-lab:~# ip addr add 192.0.2.2/31 dev v20
root@dut-lab:~# ip addr add 192.0.2.4/31 dev v30
root@dut-lab:~# nsenter --net=/var/run/netns/bottom
root@dut-lab:~# ip addr add 192.0.2.1/31 dev enp66s0f1
root@dut-lab:~# ip addr add 192.0.2.3/31 dev enp66s0f2
root@dut-lab:~# ip addr add 192.0.2.5/31 dev enp4s0f1
root@dut-lab:~# ping 192.0.2.4
PING 192.0.2.4 (192.0.2.4) 56(84) bytes of data.
64 bytes from 192.0.2.4: icmp_seq=1 ttl=64 time=0.153 ms
64 bytes from 192.0.2.4: icmp_seq=2 ttl=64 time=0.209 ms
```
To unpack this a little bit, in the first block I assign the interfaces to their respective
namespace. Then, for the interface connected to the `msw-top` switch, I create three dot1q
sub-interfaces, corresponding to the pseudowires I created. Note: untagged traffic out of
`enp66s0f0` will simply be picked up by the switch and assigned VLAN 5 (and I'm also allowed to send
VLAN tags 6, 7 and 8, which will all be handled locally).
But, VLAN 10, 20 and 30 will be moved through the MPLS network and pop out on the `msw-bottom`
switch, where they are each assigned a unique port, represented by `enp66s0f1`, `enp66s0f2` and
`enp4s0f1` connected to the bottom switch.
When I finally ping 192.0.2.4, that ICMP packet goes out on `enp4s0f1`, which enters
`msw-bottom:eth-0-3`, it gets assigned the pseudowire name _pw-vlan30_, which corresponds to the
_pw-id_ 829830, then it travels over the MPLS network, arriving at `msw-bottom` carrying a label
that tells that switch that it belongs to its local _pw-id_ 829830 which corresponds to name
_pw-vlan30_ and is assigned VLAN tag 30 on port `eth-0-1`. Phew, I made it. It actually makes sense
when you think about it!
#### VPLS
The _pseudowires_ that I described in the previous section are simply ethernet cross connects
spanning over an MPLS network. They are inherently point-to-point, much like a physical Ethernet
cable is. Sometimes, it makes more sense to take a local port and create what is called a _Virtual
Private LAN Service_ (VPLS), described in [[RFC4762]](https://www.rfc-editor.org/rfc/rfc4762.html),
where packets into this port are capable of being sent to any number of other ports on any number of
other switches, while using MPLS as transport.
By means of example, let's say a telco offers me one port in Amsterdam, one in Zurich and one in
Frankfurt. A VPLS instance would create an emulated LAN segment between these locations, in other
words a Layer 2 broadcast domain that is fully capable of learning and forwarding on Ethernet MAC
addresses but the ports are dedicated to me, and they are isolated from other customers. The telco
has essentially created a three-port switch for me, but at the same time, that telco can create
any number of VPLS services, each one unique to their individual customers. It's a pretty powerful
concept.
In principle, a VPLS consists of two parts:
1. A full mesh of simple MPLS point-to-point tunnels from each participating switch to
each other one. These are just _pseudowires_ with a given _pw-id_, just like I showed before.
1. The _pseudowires_ are then tied together in a form of bridge domain, and learning is applied to
MAC addresses that appear behind each port, signalling that these are available behind the port.
Configuration on the switch looks like this:
```
msw-top# configure terminal
interface eth-0-1
mpls-vpls v-ipng ethernet
interface eth-0-2
mpls-vpls v-ipng ethernet
interface eth-0-3
mpls-vpls v-ipng ethernet
interface eth-0-4
mpls-vpls v-ipng ethernet
!
mpls vpls v-ipng 829801
vpls-peer 172.20.0.0 raw
vpls-peer 172.20.0.1 raw
```
The first set of commands add each individual interface into the VPLS instance by binding it to a
name, in this case _v-ipng_. Then, the VPLS neighbors are specified, by offering a _pw-id_ (829801)
which is used to construct a _pseudowire_ to the two peers. The first, 172.20.0.0 is `msw-bottom`, and
the other, 172.20.0.1 is `msw-core`. Each switch that participates in the VPLS for _v-ipng_ will signal
LSPs to each of its peers, and MAC learning will be enabled just as if each of these _pseudowires_
were a regular switchport.
Once I configure this pattern on all three switches, effectively interfaces `eth-0-1 - 4` are now
bound together as a virtual switch with a unique broadcast domain dedicated to instance _v-ipng_.
I've created a fully transparent 12-port switch, which means that what-ever traffic I generate, will
be encapsulated in MPLS and sent through the MPLS network towards its destination port.
Let's take a look at the `msw-core` switch to see how this looks like:
```
msw-core# show ldp vpls
VPLS-ID Peer Address State Type Label-Sent Label-Rcvd Cw
829801 172.20.0.0 Up ethernet 32774 32773 0
829801 172.20.0.2 Up ethernet 32776 32774 0
msw-core# show mpls vpls mesh
VPLS-ID Peer Addr/name In-Label Out-Intf Out-Label Type St Evpn Type2 Sr-tunid
829801 172.20.0.0/- 32777 eth-0-50 32775 RAW Up N N -
829801 172.20.0.2/- 32778 eth-0-49 32776 RAW Up N N -
msw-core# show mpls vpls detail
Virtual Private LAN Service Instance: v-ipng, ID: 829801
Group ID: 0, Configured MTU: NULL
Description: none
AC interface :
Name TYPE Vlan
eth-0-1 Ethernet ALL
eth-0-2 Ethernet ALL
eth-0-3 Ethernet ALL
eth-0-4 Ethernet ALL
Mesh Peers :
Peer TYPE State C-Word Tunnel name LSP name
172.20.0.0 RAW UP Disable N/A N/A
172.20.0.2 RAW UP Disable N/A N/A
Vpls-mac-learning enable
Discard broadcast disabled
Discard unknown-unicast disabled
Discard unknown-multicast disabled
```
Putting this to the test, I decide to run a loadtest saturating 12x 10G of traffic through this
spiffy 12-port virtual switch. I randomly assign ports on the loadtester to the 12 ports in the
_v-ipng_ VPLS, and then I start full line rate load with 128 byte packets. Considering I'm using
twelve TenGig ports, I would expect 12x8.43 or roughly 101Mpps flowing, and indeed, the loadtests
demonstrate this mark nicely:
{{< image src="/assets/oem-switch/vpls-trex.png" alt="VPLS T-Rex" >}}
**Important**: The screenshot above shows the first four ports on the T-Rex interface only, but there
are actually _twelve ports_ participating in this loadtest. In the top right corner, the total
throughput is correctly represented. The switches are handling 120Gbps of L1, 103.5Gbps of L2 (which
is expected at 128b frames, as there is a little bit of ethernet overhead for each frame), which is
a whopping 101Mpps, which is exactly what I would expect.
And the chassis doesn't even get warm.
### Conclusions
It's just super cool to see a switch like this work as expected. I did not manage to overload it at
all, in my [[previous article]({% post_url 2022-12-05-oem-switch-1 %})], I showed VxLAN, GENEVE and
NvGRE overlays at line rate. Here, I can see that MPLS with all of its Martini bells and whistles,
and as well the more advanced VPLS, are keeping up like a champ. I think at least for initial
configuration and throughput on all MPLS features I tested, both the small 24x10 + 2x100G switch,
and the larger 48x10 + 2x40 + 4x100G switch, are keeping up just fine.
A duration test will have to show if the configuration and switch fabric are stable _over time_, but
I am hopeful that Centec is hitting the exact sweet spot for me on the MPLS transport front.
Yes, yes yes. I did as well promise to take a look at eVPN functionality (this is another form of
L3VPN which uses iBGP to share which MAC addresses live behind which VxLAN ports). This post has
been fun, but also quite long (4300 words!) so I'll follow up in a future article on the eVPN
capabilities of the Centec switches.

View File

@ -0,0 +1,505 @@
---
date: "2023-02-12T09:51:23Z"
title: 'Review: Compulab Fitlet2'
---
{{< image width="400px" float="right" src="/assets/fitlet2/Fitlet2-stock.png" alt="Fitlet" >}}
A while ago, in June 2021, we were discussing home routers that can keep up with 1G+ internet
connections in the [CommunityRack](https://www.communityrack.org) telegram channel. Of course
at IPng Networks we are fond of the Supermicro Xeon D1518 [[ref]({% post_url 2021-09-21-vpp-7 %})],
which has a bunch of 10Gbit X522 and 1Gbit i350 and i210 intel NICs, but it does come at a certain
price.
For smaller applications, PC Engines APU6 [[ref]({%post_url 2021-07-19-pcengines-apu6 %})] is
kind of cool and definitely more affordable. But, in this chat, Patrick offered an alternative,
the [[Fitlet2](https://fit-iot.com/web/products/fitlet2/)] which is a small, passively cooled,
and expandable IoT-esque machine.
Fast forward 18 months, and Patrick decided to sell off his units, so I bought one off of him,
and decided to loadtest it. Considering the pricetag (the unit I will be testing will ship for
around $400), and has the ability to use (1G/SFP) fiber optics, it may be a pretty cool one!
# Executive Summary
**TL/DR: Definitely a cool VPP router, 3x 1Gbit line rate, A- would buy again**
With some care on the VPP configuration (notably RX/TX descriptors), this unit can handle L2XC at
(almost) line rate in both directions (2.94Mpps out a theoretical 2.97Mpps), with one VPP worker
thread, which it not just good, it's _Good Enough&trade;_, at which time there is still plenty of
headroom on the CPU, as the Atom E3950 has 4 cores.
In IPv4 routing, using two VPP worker threads, and 2 RX/TX queues on each NIC, the machine keeps up
with 64 byte traffic in both directions (ie 2.97Mpps), again with compute power to spare, and while
using only two out of four CPU cores on the Atom E3950.
For a $400,- machine that draws close to 11 Watts fully loaded, and sporting 8GB (at a max of 16GB)
this Fitlet2 is a gem: it will easily keep up 3x 1Gbit in a production environment, while carrying
multiple full BGP tables (900K IPv4 and 170K IPv6), with room to spare. _It's a classy little
machine!_
## Detailed findings
{{< image width="250px" float="right" src="/assets/fitlet2/Fitlet2-BottomOpen.png" alt="Fitlet2 Open" >}}
The first thing that I noticed when it arrived is how small it is! The design of the Fitlet2 has a
motherboard with a non-removable Atom E3950 CPU running at 1.6GHz, from the _Goldmont_ series. This
is a notoriously slow/budget CPU, and it comes with 4C/4T, each CPU thread comes with 24kB of L1
and 1MB of L2 cache, and there is no L3 cache on this CPU at all. That would mean performance in
applications like VPP (which try to leverage these caches) will be poorer -- the main question on
my mind is: does the CPU have enough __oompff__ to keep up with the 1G network cards? I'll want this
CPU to be able to handle roughly 4.5Mpps in total, in order for Fitlet2 to count itself amongst the
_wirespeed_ routers.
Looking further, Fitlet2 has one HDMI and one MiniDP port, two USB2 and two USB3 ports, two Intel
i211 NICs with RJ45 port (these are 1Gbit). There's a helpful MicroSD slot, two LEDs and an audio
in- and output 3.5mm jack. The power button does worry me a little bit, I feel like just brushing
against it may turn the machine off. I do appreciate the cooling situation - the top finned plate
mates with the CPU on the top of the motherboard, and the bottom bracket holds a sizable aluminium
cooling block which further helps dissipate heat, without needing any active cooling. The Fitlet
folks claim this machine can run in environments anywhere between -50C and +112C, which I won't be
doing :)
{{< image width="400px" float="right" src="/assets/fitlet2/Fitlet2+FACET.png" alt="Fitlet2" >}}
Inside, there's a single DDR3 SODIMM slot for memory (the one I have came with 8GB at 1600MT/s) and
a custom, ableit open specification expansion board called a __FACET-Card__ which stands for
**F**unction **A**nd **C**onnectivity **E**xtension **T**-Card, well okay then! The __FACET__ card
in this little machine sports one extra Intel i210-IS NIC, an M2 for an SSD, and an M2E for a WiFi
port. The NIC is a 1Gbit SFP capable device. You can see its optic cage on the _FACET_ card above,
next to the yellow CMOS / Clock battery.
The whole thing is fed with 12V powerbrick delivering 2A, and a nice touch is that the barrel
connector has a plastic bracket that locks it into the chassis by turning it 90degrees, so it won't
flap around in the breeze and detach. I wish other embedded PCs would ship with those, as I've been
fumbling around in 19" racks that are, let me say, less tightly cable organized, and may or may not
have disconnected the CHIX routeserver at some point in the past. Sorry, Max :)
For the curious, here's a list of interesting details: [[lspci](/assets/fitlet2/lspci.txt)] -
[[dmidecode](/assets/fitlet2/dmidecode.txt)] -
[[likwid-topology](/assets/fitlet2/likwid-topology.txt)] - [[dmesg](/assets/fitlet2/dmesg.txt)].
## Preparing the Fitlet2
First, I grab a USB key and install Debian _Bullseye_ (11.5) on it, using the UEFI installer. After
booting, I carry through the instructions on my [[VPP Production]({% post_url 2021-09-21-vpp-7 %})]
post. Notably, I create the `dataplane` namespace, run an SSH and SNMP agent there, run
`isolcpus=1-3` so that I can give three worker threads to VPP, but I start off giving it only one (1)
worker thread, because this way I can take a look at what the performance is of a single CPU, before
scaling out to the three (3) threads that this CPU can offer. I also take the defaults for DPDK,
notably allowing the DPDK poll-mode-drivers to take their proposed defaults:
* **GigabitEthernet1/0/0**: Intel Corporation I211 Gigabit Network Connection (rev 03)
> rx: queues 1 (max 2), desc 512 (min 32 max 4096 align 8) <br />
> tx: queues 2 (max 2), desc 512 (min 32 max 4096 align 8)
* **GigabitEthernet3/0/0**: Intel Corporation I210 Gigabit Fiber Network Connection (rev 03)
> rx: queues 1 (max 4), desc 512 (min 32 max 4096 align 8) <br />
> tx: queues 2 (max 4), desc 512 (min 32 max 4096 align 8)
I observe that the i211 NIC allows for a maximum of two (2) RX/TX queues, while the (older!) i210
will allow for four (4) of them. And another thing that I see here is that there are two (2) TX
queues active, but I only have one worker thread, so what gives? This is because there is always a
main thread and a worker thread, and it could be that the main thread needs to / wants to send
traffic out on an interface, so it always attaches to a queue in addition to the worker thread(s).
When exploring new hardware, I find it useful to take a look at the output of a few tactical `show`
commands on the CLI, such as:
**1. What CPU is in this machine?**
```
vpp# show cpu
Model name: Intel(R) Atom(TM) Processor E3950 @ 1.60GHz
Microarch model (family): [0x6] Goldmont ([0x5c] Apollo Lake) stepping 0x9
Flags: sse3 pclmulqdq ssse3 sse41 sse42 rdrand pqe rdseed aes sha invariant_tsc
Base frequency: 1.59 GHz
```
**2. Which devices on the PCI bus, PCIe speed details, and driver?**
```
vpp# show pci
Address Sock VID:PID Link Speed Driver Product Name Vital Product Data
0000:01:00.0 0 8086:1539 2.5 GT/s x1 uio_pci_generic
0000:02:00.0 0 8086:1539 2.5 GT/s x1 igb
0000:03:00.0 0 8086:1536 2.5 GT/s x1 uio_pci_generic
```
__Note__: This device at slot `02:00.0` is the second onboard RJ45 i211 NIC. I have used this one
to log in to the Fitlet2 and more easily kill/restart VPP and so on, but I could of course just as
well give it to VPP, in which case I'd have three gigabit interfaces to play with!
**3. What details are known for the physical NICs?**
```
vpp# show hardware GigabitEthernet1/0/0
GigabitEthernet1/0/0 1 up GigabitEthernet1/0/0
Link speed: 1 Gbps
RX Queues:
queue thread mode
0 vpp_wk_0 (1) polling
TX Queues:
TX Hash: [name: hash-eth-l34 priority: 50 description: Hash ethernet L34 headers]
queue shared thread(s)
0 no 0
1 no 1
Ethernet address 00:01:c0:2a:eb:a8
Intel e1000
carrier up full duplex max-frame-size 2048
flags: admin-up maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum int-supported
rx: queues 1 (max 2), desc 512 (min 32 max 4096 align 8)
tx: queues 2 (max 2), desc 512 (min 32 max 4096 align 8)
pci: device 8086:1539 subsystem 8086:0000 address 0000:01:00.00 numa 0
max rx packet len: 16383
promiscuous: unicast off all-multicast on
vlan offload: strip off filter off qinq off
rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum vlan-filter
vlan-extend scatter keep-crc rss-hash
rx offload active: ipv4-cksum scatter
tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum
tcp-tso multi-segs
tx offload active: ipv4-cksum udp-cksum tcp-cksum multi-segs
rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp-ex ipv6-udp-ex ipv6-tcp
ipv6-udp ipv6-ex ipv6
rss active: none
tx burst function: (not available)
rx burst function: (not available)
```
### Configuring VPP
After this exploratory exercise, I have learned enough about the hardware to be able to take the
Fitlet2 out for a spin. To configure the VPP instance, I turn to
[[vppcfg](https://github.com/pimvanpelt/vppcfg)], which can take a YAML configuration file
describing the desired VPP configuration, and apply it safely to the running dataplane using the VPP
API. I've written a few more posts on how it does that, notably on its [[syntax]({% post_url
2022-03-27-vppcfg-1 %})] and its [[planner]({% post_url 2022-04-02-vppcfg-2 %})]. A complete
configuration guide on vppcfg can be found
[[here](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md)].
```
pim@fitlet:~$ sudo dpkg -i {lib,}vpp*23.06*deb
pim@fitlet:~$ sudo apt install python3-pip
pim@fitlet:~$ sudo pip install vppcfg-0.0.3-py3-none-any.whl
```
### Methodology
#### Method 1: Single CPU Thread Saturation
First I will take VPP out for a spin by creating an L2 Cross Connect where any ethernet frame
received on `Gi1/0/0` will be directly transmitted as-is on `Gi3/0/0` and vice versa. This is a
relatively cheap operation for VPP, as it will not have to do any routing table lookups. The
configuration looks like this:
```
pim@fitlet:~$ cat << EOF > l2xc.yaml
interfaces:
GigabitEthernet1/0/0:
mtu: 1500
l2xc: GigabitEthernet3/0/0
GigabitEthernet3/0/0:
mtu: 1500
l2xc: GigabitEthernet1/0/0
EOF
pim@fitlet:~$ vppcfg plan -c l2xc.yaml
[INFO ] root.main: Loading configfile l2xc.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.vppapi.connect: VPP version is 23.06-rc0~35-gaf4046134
comment { vppcfg sync: 10 CLI statement(s) follow }
set interface l2 xconnect GigabitEthernet1/0/0 GigabitEthernet3/0/0
set interface l2 tag-rewrite GigabitEthernet1/0/0 disable
set interface l2 xconnect GigabitEthernet3/0/0 GigabitEthernet1/0/0
set interface l2 tag-rewrite GigabitEthernet3/0/0 disable
set interface mtu 1500 GigabitEthernet1/0/0
set interface mtu 1500 GigabitEthernet3/0/0
set interface mtu packet 1500 GigabitEthernet1/0/0
set interface mtu packet 1500 GigabitEthernet3/0/0
set interface state GigabitEthernet1/0/0 up
set interface state GigabitEthernet3/0/0 up
[INFO ] vppcfg.reconciler.write: Wrote 11 lines to (stdout)
[INFO ] root.main: Planning succeeded
```
{{< image width="500px" float="right" src="/assets/fitlet2/l2xc-demo1.png" alt="Fitlet2 L2XC First Try" >}}
After I paste these commands on the CLI, I start T-Rex in L2 stateless mode, and start T-Rex, I can
generate some activity by starting the `bench` profile on port 0 with packets of 64 bytes in size
and with varying IPv4 source and destination addresses _and_ ports:
```
tui>start -f stl/bench.py -m 1.48mpps -p 0
-t size=64,vm=var2
```
Let me explain a few hilights from the picture to the right. When starting this profile, I
specified 1.48Mpps, which is the maximum amount of packets/second that can be generated on a 1Gbit
link when using 64 byte frames (the smallest permissible ethernet frames). I do this because the
loadtester comes with 10Gbit (and 100Gbit) ports, but the Fitlet2 has only 1Gbit ports. Then, I see
that port0 is indeed transmitting (**Tx pps**) 1.48 Mpps, shown in dark blue. This is about 992 Mbps
on the wire (the **Tx bps L1**), but due to the overhead of ethernet (each 64 byte ethernet frame
needs an additional 20 bytes [[details](https://en.wikipedia.org/wiki/Ethernet_frame)]), so the **Tx
bps L2** is about `64/84 * 992.35 = 756.08` Mbps, which lines up.
Then, after the Fitlet2 tries its best to forward those from its receiving Gi1/0/0 port onto its
transmitting port Gi3/0/0, they are received again by T-Rex on port 1. Here, I can see that the **Rx
pps** is 1.29 Mpps, with an **Rx bps** of 660.49 Mbps (which is the L2 counter), and in bright red
at the top I see the **drop_rate** is about 95.59 Mbps. In other words, the Fitlet2 is _not keeping
up_.
But, after I take a look at the runtime statistics, I see that the CPU isn't very busy at all:
```
vpp# show run
...
Thread 1 vpp_wk_0 (lcore 1)
Time 23.8, 10 sec internal node vector rate 4.30 loops/sec 1638976.68
vector rates in 1.2908e6, out 1.2908e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
GigabitEthernet3/0/0-output active 6323688 27119700 0 9.14e1 4.29
GigabitEthernet3/0/0-tx active 6323688 27119700 0 1.79e2 4.29
dpdk-input polling 44406936 27119701 0 5.35e2 .61
ethernet-input active 6323689 27119701 0 1.42e2 4.29
l2-input active 6323689 27119701 0 9.94e1 4.29
l2-output active 6323689 27119701 0 9.77e1 4.29
```
Very interesting! Notice that the line above says `vector rates in .. out ..` are saying that the
thread is receiving only 1.29Mpps, and it is managing to send all of them out as well. When a VPP
worker is busy, each DPDK call will yield many packets, up to 256 in one call, which means the
amount of "vectors per call" will rise. Here, I see that on average, DPDK is returning an average of
only 0.61 packets each time it polls the NIC, and in each time a bunch of the packets are sent off
into the VPP graph, there is an average of 4.29 packets per loop. If the CPU was the bottleneck, it
would look more like 256 in the Vectors/Call column -- so the **bottleneck must be in the NIC**.
Remember above, when I showed the `show hardware` command output? There's a clue in there. The
Fitlet2 has two onboard i211 NICs and one i210 NIC on the _FACET_ card. Despite the lower number,
the i210 is a bit more advanced
[[datasheet](/assets/fitlet2/i210_ethernet_controller_datasheet-257785.pdf)]. If I reverse the
direction of flow (so receiving on the i210 Gi3/0/0, and transmitting on the i211 Gi1/0/0), things
look a fair bit better:
```
vpp# show run
...
Thread 1 vpp_wk_0 (lcore 1)
Time 12.6, 10 sec internal node vector rate 4.02 loops/sec 853956.73
vector rates in 1.4799e6, out 1.4799e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
GigabitEthernet1/0/0-output active 4642964 18652932 0 9.34e1 4.02
GigabitEthernet1/0/0-tx active 4642964 18652420 0 1.73e2 4.02
dpdk-input polling 12200880 18652933 0 3.27e2 1.53
ethernet-input active 4642965 18652933 0 1.54e2 4.02
l2-input active 4642964 18652933 0 1.04e2 4.02
l2-output active 4642964 18652933 0 1.01e2 4.02
```
Hey, would you look at that! The line up top here shows vector rates of in 1.4799e6 (which is
1.48Mpps) and outbound is the same number. And in this configuration as well, the DPDK node isn't
even reading that many packets, and the graph traversal is on average with 4.02 packets per run,
which means that this CPU can do in excess of 1.48Mpps on one (1) CPU thread. Slick!
So what _is_ the maximum throughput per CPU thread? To show this, I will saturate both ports with
line rate traffic, and see what makes it through the other side. After instructing the T-Rex to
perform the following profile:
```
tui>start -f stl/bench.py -m 1.48mpps -p 0 1 \
-t size=64,vm=var2
```
T-Rex will faithfully start to send traffic on both ports and expect the same amount back from the
Fitlet2 (the _Device Under Test_ or _DUT_). I can see that from T-Rex port 1->0 all traffic makes
its way back, but from port 0->1 there is a little bit of loss (for the 1.48Mpps sent, only 1.43Mpps
is returned). This is the same phenomenon that I explained above -- the i211 NIC is not quite as
good at eating packets as the i210 NIC is.
Even when doing this though, the (still) single threaded VPP is keeping up just fine, CPU wise:
```
vpp# show run
...
Thread 1 vpp_wk_0 (lcore 1)
Time 13.4, 10 sec internal node vector rate 13.59 loops/sec 122820.33
vector rates in 2.9599e6, out 2.8834e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
GigabitEthernet1/0/0-output active 1822674 19826616 0 3.69e1 10.88
GigabitEthernet1/0/0-tx active 1822674 19597360 0 1.51e2 10.75
GigabitEthernet3/0/0-output active 1823770 19826612 0 4.79e1 10.87
GigabitEthernet3/0/0-tx active 1823770 19029508 0 1.56e2 10.43
dpdk-input polling 1827320 39653228 0 1.62e2 21.70
ethernet-input active 3646444 39653228 0 7.67e1 10.87
l2-input active 1825356 39653228 0 4.96e1 21.72
l2-output active 1825356 39653228 0 4.58e1 21.72
```
Here we can see 2.96Mpps received (_vector rates in_) while only 2.88Mpps are transmitted (_vector
rates out_). First off, this lines up perfectly with the reporting of T-Rex in the screenshot above,
and it also shows that one direction loses more packets than the other. We're dropping some 80kpps,
but where did they go? Looking at the statistics counters, which include any packets which had
errors in processing, we learn more:
```
vpp# show err
Count Node Reason Severity
3109141488 l2-output L2 output packets error
3109141488 l2-input L2 input packets error
9936649 GigabitEthernet1/0/0-tx Tx packet drops (dpdk tx failure) error
32120469 GigabitEthernet3/0/0-tx Tx packet drops (dpdk tx failure) error
```
{{< image width="500px" float="right" src="/assets/fitlet2/l2xc-demo2.png" alt="Fitlet2 L2XC Second Try" >}}
Aha! From previous experience I know that when DPDK signals packet drops due to 'tx failure',
that this is often because it's trying to hand off the packet to the NIC, which has a ringbuffer to
collect them while the hardware transmits them onto the wire, and this NIC has run out of slots,
which means the packet has to be dropped and a kitten gets hurt. But, I can raise the number of
RX and TX slots, by setting them in VPP's `startup.conf` file:
```
dpdk {
dev default {
num-rx-desc 512 ## default
num-tx-desc 1024
}
no-multi-seg
}
```
And with that simple tweak, I've succeeded in configuring the Fitlet2 in a way that it is capable of
receiving and transmitting 64 byte packets in both directions at (almost) line rate, with **one CPU
thread**.
#### Method 2: Rampup using trex-loadtest.py
For this test, I decide to put the Fitlet2 into L3 mode (up until now it was set up in _L2 Cross
Connect_ mode). To do this, I give the interfaces an IPv4 address and set a route for the loadtest
traffic (which will be coming from `16.0.0.0/8` and going to `48.0.0.0/8`). I will once again look
to `vppcfg` to do this, because manipulating the YAML files like this allow me to easily and reliabily
swap back and forth, letting `vppcfg` do the mundane chore of figuring out what commands to type, in
which order, safely.
From my existing L2XC dataplane configuration, I switch to L3 like so:
```
pim@fitlet:~$ cat << EOF > l3.yaml
interfaces:
GigabitEthernet1/0/0:
mtu: 1500
lcp: e1-0-0
addresses: [ 100.64.10.1/30 ]
GigabitEthernet3/0/0:
mtu: 1500
lcp: e3-0-0
addresses: [ 100.64.10.5/30 ]
EOF
pim@fitlet:~$ vppcfg plan -c l3.yaml
[INFO ] root.main: Loading configfile l3.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.vppapi.connect: VPP version is 23.06-rc0~35-gaf4046134
comment { vppcfg prune: 2 CLI statement(s) follow }
set interface l3 GigabitEthernet1/0/0
set interface l3 GigabitEthernet3/0/0
comment { vppcfg create: 2 CLI statement(s) follow }
lcp create GigabitEthernet1/0/0 host-if e1-0-0
lcp create GigabitEthernet3/0/0 host-if e3-0-0
comment { vppcfg sync: 2 CLI statement(s) follow }
set interface ip address GigabitEthernet1/0/0 100.64.10.1/30
set interface ip address GigabitEthernet3/0/0 100.64.10.5/30
[INFO ] vppcfg.reconciler.write: Wrote 9 lines to (stdout)
[INFO ] root.main: Planning succeeded
```
One small note -- `vppcfg` cannot set routes, and this is by design as the Linux Control Plane is
meant to take care of that. I can either set routes using `ip` in the `dataplane` network namespace,
like so:
```
pim@fitlet:~$ sudo nsenter --net=/var/run/netns/dataplane
root@fitlet:/home/pim# ip route add 16.0.0.0/8 via 100.64.10.2
root@fitlet:/home/pim# ip route add 48.0.0.0/8 via 100.64.10.6
```
Or, alternatively, I can set them directly on VPP in the CLI, interestingly with identical syntax:
```
pim@fitlet:~$ vppctl
vpp# ip route add 16.0.0.0/8 via 100.64.10.2
vpp# ip route add 48.0.0.0/8 via 100.64.10.6
```
The loadtester will run a bunch of profiles (1514b, _imix_, 64b with multiple flows, and 64b with
only one flow), either in unidirectional or bidirectional mode, which gives me a wealth of data to
share:
Loadtest | 1514b | imix | Multi 64b | Single 64b
-------------------- | -------- | -------- | --------- | ----------
***Bidirectional*** | [81.7k (100%)](/assets/fitlet2/fitlet2.bench-var2-1514b-bidirectional.html) | [327k (100%)](/assets/fitlet2/fitlet2.bench-var2-imix-bidirectional.html) | [1.48M (100%)](/assets/fitlet2/fitlet2.bench-var2-bidirectional.html) | [1.43M (98.8%)](/assets/fitlet2/fitlet2.bench-bidirectional.html)
***Unidirectional*** | [73.2k (89.6%)](/assets/fitlet2/fitlet2.bench-var2-1514b-unidirectional.html) | [255k (78.2%)](/assets/fitlet2/fitlet2.bench-var2-imix-unidirectional.html) | [1.18M (79.4%)](/assets/fitlet2/fitlet2.bench-var2-unidirectional.html) | [1.23M (82.7%)](/assets/fitlet2/fitlet2.bench-bidirectional.html)
## Caveats
While all results of the loadtests are navigable [[here](/assets/fitlet2/fitlet2.html)], I will cherrypick
one interesting bundle showing the results of _all_ (bi- and unidirectional) tests:
{{< image src="/assets/fitlet2/loadtest.png" alt="Fitlet2 All Loadtests" >}}
I have to admit I was a bit stumped with the unidirectional loadtests - these
are pushing traffic into the i211 (onboard RJ45) NIC, and out of the i210
(_FACET_ SFP) NIC. What I found super weird (and can't really explain), is
that the _unidirectional_ load, which in the end serves half the packets/sec,
is __lower__ than the _bidirectional_ load, which was almost perfect dropping
only a little bit of traffic at the very end. A picture says a thousand words -
so here's a graph of all the loadtests, which you can also find by clicking on
the links in the table.
## Appendix
### Generating the data
The JSON files that are emitted by my loadtester script can be fed directly into Michal's
[visualizer](https://github.com/wejn/trex-loadtest-viz) to plot interactive graphs (which I've
done for the table above):
```
DEVICE=Fitlet2
## Loadtest
SERVER=${SERVER:=hvn0.lab.ipng.ch}
TARGET=${TARGET:=l3}
RATE=${RATE:=10} ## % of line
DURATION=${DURATION:=600}
OFFSET=${OFFSET:=10}
PROFILE=${PROFILE:="ipng"}
for DIR in unidirectional bidirectional; do
for SIZE in 1514 imix 64; do
[ $DIR == "unidirectional" ] && FLAGS="-u "
## Multiple Flows
./trex-loadtest -s ${SERVER} ${FLAGS} -p $PROFILE}.py -t "offset=${OFFSET},vm=var2,size=${SIZE}" \
-rd ${DURATION} -rt ${RATE} -o ${DEVICE}-${TARGET}-${PROFILE}-var2-${SIZE}-${DIR}.json
[ $SIZE -eq 64 ] && {
## Specialcase: Single Flow
./trex-loadtest -s ${SERVER} ${FLAGS -p ${PROFILE}.py -t "offset=${OFFSET},size=${SIZE}" \
-rd ${DURATION} -rt ${RATE} -o ${DEVICE}-${TARGET}-${PROFILE}-${SIZE}-${DIR}.json
}
done
done
## Graphs
ruby graph.rb -t "${DEVICE} All Loadtests" ${DEVICE}*.json -o ${DEVICE}.html
ruby graph.rb -t "${DEVICE} Unidirectional Loadtests" ${DEVICE}*unidir*.json \
-o ${DEVICE}.unidirectional.html
ruby graph.rb -t "${DEVICE} Bidirectional Loadtests" ${DEVICE}*bidir*.json \
-o ${DEVICE}.bidirectional.html
for i in ${PROFILE}-var2-1514 ${PROFILE}-var2-imix ${PROFILE}-var2-64 ${PROFILE}-64; do
ruby graph.rb -t "${DEVICE} Unidirectional Loadtests" ${DEVICE}*-${i}*unidirectional.json \
-o ${DEVICE}.$i-unidirectional.html; done
ruby graph.rb -t "${DEVICE} Bidirectional Loadtests" ${DEVICE}*-${i}*bidirectional.json \
-o ${DEVICE}.$i-bidirectional.html; done
done
```

View File

@ -0,0 +1,678 @@
---
date: "2023-02-24T07:31:00Z"
title: 'Case Study: VPP at Coloclue, part 2'
---
{{< image width="300px" float="right" src="/assets/coloclue-vpp/coloclue_logo2.png" alt="Yoloclue" >}}
* Author: Pim van Pelt, Rogier Krieger
* Reviewers: Coloclue Network Committee
* Status: Draft - Review - **Published**
Almost precisely two years ago, in February of 2021, I created a loadtesting environment at
[[Coloclue](https://coloclue.net)] to prove that a provider of L2 connectivity between two
datacenters in Amsterdam was not incurring jitter or loss on its services -- I wrote up my findings
in [[an article]({% post_url 2021-02-27-coloclue-loadtest %})], which demonstrated that the service
provider indeed provides a perfect service. One month later, in March 2021, I briefly ran
[[VPP](https://fd.io)] on one of the routers at Coloclue, but due to lack of time and a few
technical hurdles along the way, I had to roll back [[ref]({% post_url 2021-03-27-coloclue-vpp %})].
## The Problem
Over the years, Coloclue AS8283 continues to suffer from packet loss in its network. Taking a look
at a simple traceroute, in this case from IPng AS8298, shows very high variance and _packetlo_
when entering the network (at hop 5 in a router called `eunetworks-2.router.nl.coloclue.net`):
```
My traceroute [v0.94]
squanchy.ipng.ch (194.1.193.90) -> 185.52.227.1 2023-02-24T09:03:36+0100
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. chbtl0.ipng.ch 0.0% 49904 1.3 0.9 0.7 1.7 0.2
2. chrma0.ipng.ch 0.0% 49904 1.7 1.2 1.2 2.1 0.9
3. defra0.ipng.ch 0.0% 49904 6.3 6.2 6.0 19.2 1.3
4. nlams0.ipng.ch 0.0% 49904 12.7 12.6 12.4 19.8 1.8
5. bond0-105.eunetworks-2.router.nl.coloclue.net 0.2% 49903 98.8 12.3 12.0 272.8 23.0
6. 185.52.227.1 6.6% 49903 15.3 12.5 12.3 308.7 20.4
```
The last two hops show the packet loss well north of 6.5%, some paths are better, some are worse,
but notably when more than one router is in the path, it's difficult to pinpoint where or what is
responsible. But honestly, any source will reveal packet loss and high variance when traversing
through one or more Coloclue routers, to more or lesser degree:
--------------- | ---------------------
![Before nlede01](/assets/coloclue-vpp/coloclue-beacon-before-nlams01.png) | ![Before chbtl01](/assets/coloclue-vpp/coloclue-beacon-before-chbtl01.png)
_The screenshots above are smokeping from (left) a machine at AS8283 Coloclue (in Amsterdam, the
Netherlands), and from (right) a machine at AS8298 IPng (in Br&uuml;ttisellen, Switzerland), both
are showing ~4.8-5.0% packetlo and high variance in end to end latency. No bueno!_
## Isolating a Device Under Test
Because Coloclue has several routers, I want to ensure that traffic traverses only the _one router_ under
test. I decide to use an allocated but currently unused IPv4 prefix and announce that only from one
of the four routers, so that all traffic to and from that /24 goes over that router. Coloclue uses a
piece of software called [Kees](https://github.com/coloclue/kees.git), a set of Python and Jinja2
scripts to generate a Bird1.6 configuration for each router. This is great because that allows me to
add a small feature to get what I need: **beacons**.
### Setting up the beacon
A beacon is a prefix that is sent to (some, or all) peers on the internet to attract traffic in a
particular way. I added a function called `is_coloclue_beacon()` which reads the input YAML file and
uses a construction similar to the existing feature for "supernets". It determines if a given prefix
must be announced to peers and upstreams. Any IPv4 and IPv6 prefixes from the `beacons` list will be
then matched in `is_coloclue_beacon()` and announced. For the curious, [[this
commit](https://github.com/coloclue/kees/commit/3710f1447ade10384c86f35b2652565b440c6aa6)] holds the
logic and tests to ensure this is safe.
Based on a per-router config (eg. `vars/eunetworks-2.router.nl.coloclue.net.yml`) I can now add the
following YAML stanza:
```
coloclue:
beacons:
- prefix: "185.52.227.0"
length: 24
comment: "VPP test prefix (pim, rogier)"
```
And further, from this router, I can forward all traffic destined to this /24 to a machine running
in EUNetworks (my Dell R630 called `hvn0.nlams2.ipng.ch`), using a simple static route:
```
statics:
...
- route: "185.52.227.0/24"
via: "94.142.240.71"
comment: "VPP test prefix (pim, rogier)"
```
After running Kees, I can now see traffic for that /24 show up on my machine. The last step is to
ensure that traffic that is destined for the beacon will always traverse back over `eunetworks-2`.
Coloclue has VRRP and sometimes another router might be the logical router. With a little trick on
my machine, I can force traffic by means of _policy based routing_:
```
pim@hvn0-nlams2:~$ sudo ip ro add default via 94.142.240.254
pim@hvn0-nlams2:~$ sudo ip ro add prohibit 185.52.227.0/24
pim@hvn0-nlams2:~$ sudo ip addr add 185.52.227.1/32 dev lo
pim@hvn0-nlams2:~$ sudo ip rule add from 185.52.227.0/24 lookup 10
pim@hvn0-nlams2:~$ sudo ip ro add default via 94.142.240.253 table 10
```
First, I set the default gateway to be the VRRP address that floats between multiple routers. Then,
I will set a `prohibit` route for the covering /24, which means the machine will send an ICMP
unreachable (rather than discarding the packets), which can be useful later. Next, I'll add .1 as an
IPv4 address onto loopback, after which the machine will start replying to ICMP packets there with
icmp-echo rather than dst-unreach. To make sure routing is always symmetric, I'll add an `ip rule`
which is a classifier that matches packets based on their source address, and then diverts these to
an alternate routing table, which has only one entry: send via .253 (which is `eunetworks-2`).
Let me show this in action:
```
pim@hvn0-nlams2:~$ dig +short -x 94.142.240.254
eunetworks-gateway-100.router.nl.coloclue.net.
pim@hvn0-nlams2:~$ dig +short -x 94.142.240.253
bond0-100.eunetworks-2.router.nl.coloclue.net.
pim@hvn0-nlams2:~$ dig +short -x 94.142.240.252
bond0-100.eunetworks-3.router.nl.coloclue.net.
pim@hvn0-nlams2:~$ ip -4 nei | grep '94.142.240.25[234]'
94.142.240.252 dev coloclue lladdr 64:9d:99:b1:31:db REACHABLE
94.142.240.253 dev coloclue lladdr 64:9d:99:b1:31:af REACHABLE
94.142.240.254 dev coloclue lladdr 64:9d:99:b1:31:db REACHABLE
```
In the output above, I can see that `eunetworks-2` (94.142.240.253) has MAC address
`64:9d:99:b1:31:af`, and that `eunetworks-3` (94.142.240.252) has MAC address `64:9d:99:b1:31:db`.
My default gateway, handled by VRRP, is at .254 and it's using the second MAC address, so I know that
`eunetworks-3` is primary, and will handle my egress traffic.
### Verifying symmetric routing of the beacon
A quick demonstration to show the symmetric routing case, I can tcpdump and see that my "usual"
egress traffic will be sent to the MAC address of the VRRP primary (which I showed to be
`eunetworks-3` above), while traffic coming from 185.52.227.0/24 ought to be sent to the
MAC address of `eunetworks-2` due to the `ip rule` and alternate routing table 10:
```
pim@hvn0-nlams2:~$ sudo tcpdump -eni coloclue host 194.1.163.93 and icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on coloclue, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:02:17.193844 64:9d:99:b1:31:af > 6e:fa:52:d0:c1:ff, ethertype IPv4 (0x0800), length 98:
194.1.163.93 > 94.142.240.71: ICMP echo request, id 16287, seq 1, length 64
10:02:17.193882 6e:fa:52:d0:c1:ff > 64:9d:99:b1:31:db, ethertype IPv4 (0x0800), length 98:
94.142.240.71 > 194.1.163.93: ICMP echo reply, id 16287, seq 1, length 64
10:02:19.276657 64:9d:99:b1:31:af > 6e:fa:52:d0:c1:ff, ethertype IPv4 (0x0800), length 98:
194.1.163.93 > 185.52.227.1: ICMP echo request, id 6646, seq 1, length 64
10:02:19.276694 6e:fa:52:d0:c1:ff > 64:9d:99:b1:31:af, ethertype IPv4 (0x0800), length 98:
185.52.227.1 > 194.1.163.93: ICMP echo reply, id 6646, seq 1, length 64
```
It takes a keen eye to spot the difference here the first packet (which is going to the main
IPv4 address 94.142.240.71), is returned via MAC address `64:9d:99:b1:31:db` (the VRRP default
gateway), but the second one (going to the beacon 185.52.227.1) is returned via MAC address
`64:9d:99:b1:31:af`.
I've now ensured that traffic to and from 185.52.227.1 will always traverse through the DUT
(`eunetworks-2` with MAC `64:9d:99:b1:31:af`). Very elegant :-)
## Installing VPP
I've written about this before, the general _spiel_ is just following my previous article (I'm
often very glad to read back my own articles as they serve as pretty good documentation to my
forgetful chipmunk-sized brain!), so here, I'll only recap what's already written in
[[vpp-7]({% post_url 2021-09-21-vpp-7 %})]:
1. Build VPP with Linux Control Plane
1. Bring `eunetworks-2` into maintenance mode, so we can safely tinker with it
1. Start services like ssh, snmp, keepalived and bird in a new `dataplane` namespace
1. Start VPP and give the LCP interface names the same as their original
1. Slowly introduce the router: OSPF, OSPFv3, iBGP, members-bgp, eBGP, in that order
1. Re-enable keepalived and let the machine forward traffic
1. Stare at the latency graphs
{{< image width="500px" float="right" src="/assets/coloclue-vpp/likwid-topology.png" alt="Likwid Topology" >}}
**1. BUILD:** For the first step, the build is straight forward, and yields a VPP instance based on
`vpp-ext-deps_23.06-1` at version `23.06-rc0~71-g182d2b466`, which contains my
[[LCPng](https://github.com/pimvanpelt/lcpng.git)] plugin. I then copy the packages to the router.
The router has an E-2286G CPU @ 4.00GHz with 6 cores and 6 hyperthreads. There's a really handy tool
called `likwid-topology` that can show how the L1, L2 and L3 cache lines up with respect to CPU
cores. Here I learn that CPU (0+6) and (1+7) share L1 and L2 cache -- so I can conclude that 0-5 are
CPU cores which share a hyperthread with 6-11 respectively.
I also see that L3 cache is shared across all of the cores+hyperthreads, which is normal. I decide
to give CPUs 0,1 and their hyperthread 6,7 to Linux for general purpose scheduling, and I want to
block the remaining CPUs and their hyperthreads to dedicated to VPP. So the kernel is rebooted with
`isolcpus=2-5,8-11`.
**2. DRAIN:** In the mean time, Rogier prepares the drain, which is two step process. First he marks all
the BGP sessions as `graceful_shutdown: True`, and waits for the traffic to die down. Then, he marks
the machine as `maintenance_mode: True` which will make Kees set OSPF cost to 65535 and avoid
attracting or sending traffic through this machine. After he submits these, we are free to tinker
with the router, as it will not affect any Coloclue members. Rogier also ensures we will have the
hand on this little machine in Amsterdam, by preparing an IPMI serial-over-lan connection and KVM.
**3. PREPARE:** Starting an ssh and snmpd in the dataplane is the most important part. This way, we will be
able to scrape the machine using SNMP just as-if it were a Linux native router. And of course we
will want to be able to log in to the router. I start with these two services, the only small
note is that, because I want to run two copies (one in the default namespace and one additional
one in the dataplane namespace), I'll want to tweak the startup flags (pid file, config file, etc) a
little bit:
```
## in snmpd-dataplane.service
ExecStart=/sbin/ip netns exec dataplane /usr/sbin/snmpd -LOw -u Debian-snmp \
-g vpp -I -smux,mteTrigger,mteTriggerConf -f -p /run/snmpd-dataplane.pid \
-C -c /etc/snmp/snmpd-dataplane.conf
## in ssh-dataplane.service
ExecStart=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd \
-oPidFile=/run/sshd-dataplane.pid -D $SSHD_OPTS
```
**4. LAUNCH:** Now what's left for us to do is switch from our SSH session to an IPMI serial-over-lan session
so that we can safely transition to the VPP world. Rogier and I log in and share a tmux session,
after which I bring down all ethernet links, remove VLAN sub-interfaces and the LACP BondEthernet,
leaving only the main physical interfaces. I then set link down on them, and restart VPP -- which
will take all DPDK eligble interfaces that are link admin-down, and then let the magic happen:
```
root@eunetworks-2:~# vppctl show int
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
GigabitEthernet5/0/0 5 down 9000/0/0/0
GigabitEthernet6/0/0 6 down 9000/0/0/0
TenGigabitEthernet1/0/0 1 down 9000/0/0/0
TenGigabitEthernet1/0/1 2 down 9000/0/0/0
TenGigabitEthernet1/0/2 3 down 9000/0/0/0
TenGigabitEthernet1/0/3 4 down 9000/0/0/0
```
Dope! One way to trick the rest of the machine into thinking it hasn't changed, is to recreate these
interfaces in the `dataplane` network namespace using their _original_ interface names (eg.
`enp1s0f3` for AMS-IX, and `bond0` for the LACP signaled BondEthernet that we'll create. Rogier
prepared an excellent `vppcfg` config file:
```
loopbacks:
loop0:
description: 'eunetworks-2.router.nl.coloclue.net'
lcp: 'loop0'
mtu: 9216
addresses: [ 94.142.247.3/32, 2a02:898:0:300::3/128 ]
bondethernets:
BondEthernet0:
description: 'Core: MLAG member switches'
interfaces: [ TenGigabitEthernet1/0/0, TenGigabitEthernet1/0/1 ]
mode: 'lacp'
load-balance: 'l34'
mac: '64:9d:99:b1:31:af'
interfaces:
GigabitEthernet5/0/0:
description: "igb 0000:05:00.0 eno1 # FiberRing"
lcp: 'eno1'
mtu: 9216
sub-interfaces:
205:
description: "Peering: Arelion"
lcp: 'eno1.205'
addresses: [ 62.115.144.33/31, 2001:2000:3080:ebc::2/126 ]
mtu: 1500
992:
description: "Transit: FiberRing"
lcp: 'eno1.992'
addresses: [ 87.255.32.130/30, 2a00:ec8::102/126 ]
mtu: 1500
GigabitEthernet6/0/0:
description: "igb 0000:06:00.0 eno2 # Free"
lcp: 'eno2'
mtu: 9216
state: down
TenGigabitEthernet1/0/0:
description: "i40e 0000:01:00.0 enp1s0f0 (bond-member)"
mtu: 9216
TenGigabitEthernet1/0/1:
description: "i40e 0000:01:00.1 enp1s0f1 (bond-member)"
mtu: 9216
TenGigabitEthernet1/0/2:
description: 'Core: link between eunetworks-2 and eunetworks-3'
lcp: 'enp1s0f2'
addresses: [ 94.142.247.246/31, 2a02:898:0:301::/127 ]
mtu: 9214
TenGigabitEthernet1/0/3:
description: "i40e 0000:01:00.3 enp1s0f3 # AMS-IX"
lcp: 'enp1s0f3'
mtu: 9216
sub-interfaces:
501:
description: "Peering: AMS-IX"
lcp: 'enp1s0f3.501'
addresses: [ 80.249.211.161/21, 2001:7f8:1::a500:8283:1/64 ]
mtu: 1500
511:
description: "Peering: NBIP-NaWas via AMS-IX"
lcp: 'enp1s0f3.511'
addresses: [ 194.62.128.38/24, 2001:67c:608::f200:8283:1/64 ]
mtu: 1500
BondEthernet0:
lcp: 'bond0'
mtu: 9216
sub-interfaces:
100:
description: "Cust: Members"
lcp: 'bond0.100'
mtu: 1500
addresses: [ 94.142.240.253/24, 2a02:898:0:20::e2/64 ]
101:
description: "Core: Powerbars"
lcp: 'bond0.101'
mtu: 1500
addresses: [ 172.28.3.253/24 ]
105:
description: "Cust: Members (no strict uRPF filtering)"
lcp: 'bond0.105'
mtu: 1500
addresses: [ 185.52.225.14/28, 2a02:898:0:21::e2/64 ]
130:
description: "Core: Link between eunetworks-2 and dcg-1"
lcp: 'bond0.130'
mtu: 1500
addresses: [ 94.142.247.242/31, 2a02:898:0:301::14/127 ]
2502:
description: "Transit: Fusix Networks"
lcp: 'bond0.2502'
mtu: 1500
addresses: [ 37.139.140.27/31, 2a00:a7c0:e20b:104::2/126 ]
```
We take this configuration and pre-generate a suitable VPP config, which exposes two little bugs
in `vppcfg`:
* Rogier had used captial letters in his IPv6 addresses (ie. `2001:2000:3080:0EBC::2`), while the
dataplane reports lower case (ie. `2001:2000:3080:ebc::2`), which consistently yield a diff that's
not there. I make a note to fix that.
* When I create the initial `--novpp` config, there's a bug in `vppcfg` where I incorrectly
reference a dataplane object which I haven't initialized (because with `--novpp` the tool
will not contact the dataplane at all. That one was easy to fix, which I did in [[this
commit](https://github.com/pimvanpelt/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
After that small detour, I can now proceed to configure the dataplane by offering the resulting
VPP commands, like so:
```
root@eunetworks-2:~# vppcfg plan --novpp -c /etc/vpp/vppcfg.yaml \
-o /etc/vpp/config/vppcfg.vpp
[INFO ] root.main: Loading configfile /etc/vpp/vppcfg.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.reconciler.write: Wrote 84 lines to /etc/vpp/config/vppcfg.vpp
[INFO ] root.main: Planning succeeded
root@eunetworks-2:~# vppctl exec /etc/vpp/config/vppcfg.vpp
```
**5. UNDRAIN:** The VPP dataplane comes to life, only to immediately hang. Whoops! What follows is a
90 minute forray into the innards of VPP (and Bird) which I haven't yet fully understood, but will
definitely want to learn more about (future article, anyone?) -- but the TL/DR of our investigation
is that if an IPv6 address is added to a loopback device, and an OSPFv3 (IPv6) stub area is created
on it, as is common for IPv4 and IPv6 loopback addresses in OSPF, then the dataplane immediately
hangs on the controlplane, but does continue to forward traffic.
However, we also find a workaround, which is to put the IPv6 loopback address on a physical
interface instead of a loopback interface. Then, we observe a perfectly functioning dataplane, which
has a working BondEthernet with LACP signalling:
```
root@eunetworks-2:~# vppctl show bond details
BondEthernet0
mode: lacp
load balance: l34
number of active members: 2
TenGigabitEthernet1/0/1
TenGigabitEthernet1/0/0
number of members: 2
TenGigabitEthernet1/0/0
TenGigabitEthernet1/0/1
device instance: 0
interface id: 0
sw_if_index: 8
hw_if_index: 8
root@eunetworks-2:~# vppctl show lacp
actor state partner state
interface name sw_if_index bond interface exp/def/dis/col/syn/agg/tim/act exp/def/dis/col/syn/agg/tim/act
TenGigabitEthernet1/0/0 1 BondEthernet0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1
LAG ID: [(ffff,64-9d-99-b1-31-af,0008,00ff,0001), (8000,02-1c-73-0f-8b-bc,0015,8000,8015)]
RX-state: CURRENT, TX-state: TRANSMIT, MUX-state: COLLECTING_DISTRIBUTING, PTX-state: PERIODIC_TX
TenGigabitEthernet1/0/1 2 BondEthernet0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1
LAG ID: [(ffff,64-9d-99-b1-31-af,0008,00ff,0002), (8000,02-1c-73-0f-8b-bc,0015,8000,0015)]
RX-state: CURRENT, TX-state: TRANSMIT, MUX-state: COLLECTING_DISTRIBUTING, PTX-state: PERIODIC_TX
```
**6. WRAP UP:** After doing a bit of standard issue ping / ping6 and `show err` and `show log`,
things are looking good. Rogier and I are now ready to slowly introduce the router: we first turn on
OSPF and OSPFv3, see adjacencies and BFD turn up. We make a note that `enp1s0f2` (which is now a
_LIP_ in the dataplane) does not have BFD while it does have OSPF, and the explanation for this is
that `bond0` is connected to a switch, while `enp1s0f2` is directly connected to its peer via a
cross connect cable, so if it fails, it'll be able to use link-state to quickly reconverge, while
the ethernet link may still be up on `bond0` if something along the transport path were to fail, so
BFD is the better choice there. Smart thinking, Coloclue!
```
root@eunetworks-2:~# birdc6 show ospf nei ospf1
BIRD 1.6.8 ready.
ospf1:
Router ID Pri State DTime Interface Router IP
94.142.247.1 1 Full/PtP 00:33 bond0.130 fe80::669d:99ff:feb1:394b
94.142.247.6 1 Full/PtP 00:31 enp1s0f2 fe80::669d:99ff:feb1:31d8
root@eunetworks-2:~# birdc show bfd ses
BIRD 1.6.8 ready.
bfd1:
IP address Interface State Since Interval Timeout
94.142.247.243 bond0.130 Up 2023-02-24 15:56:29 0.100 0.500
```
We are then ready to undrain iBGP and eBGP to members, transit and peering sessions. Rogier swiftly
takes care of business, and the router finds its spot in the DFZ just a few minutes later:
```
root@eunetworks-2:~# birdc show route count
BIRD 1.6.8 ready.
6239493 of 6239493 routes for 907650 networks
root@eunetworks-2:~# birdc6 show route count
BIRD 1.6.8 ready.
1152345 of 1152345 routes for 169987 networks
root@eunetworks-2:~# vppctl show ip fib sum
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none
locks:[adjacency:1, default-route:1, lcp-rt:1, ]
Prefix length Count
0 1
4 2
8 16
9 13
10 38
11 103
12 299
13 577
14 1214
15 2093
16 13477
17 8250
18 13824
19 24990
20 43089
21 51191
22 109106
23 97073
24 542106
27 3
28 13
29 32
30 36
31 41
32 788
root@eunetworks-2:~# vppctl show ip6 fib sum
ipv6-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none
locks:[adjacency:1, default-route:1, lcp-rt:1, ]
Prefix length Count
128 863
127 4
126 4
125 1
120 2
64 22
60 17
52 2
49 2
48 80069
47 3535
46 3411
45 1726
44 14909
43 1041
42 2529
41 932
40 14126
39 1459
38 1654
37 988
36 6640
35 1374
34 3419
33 3707
32 22819
31 294
30 589
29 4373
28 196
27 20
26 15
25 8
24 30
23 7
22 7
21 3
20 15
19 1
10 1
0 1
```
One thing that I really appreciate is how ... _normal_ ... this machine looks, with no interfaces in
the default namespace, but after switching to the dataplane network namespace using `nsenter`, there
they are and they look (unsurprisingly, because we configured them that way), identical to what was
running before, except now all goverend by VPP instead of the Linux kernel:
```
root@eunetworks-2:~# ip -br l
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
root@eunetworks-2:~# nsenter --net=/var/run/netns/dataplane
root@eunetworks-2:~# ip -br l
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eno1 UP ac:1f:6b:e0:b1:0c <BROADCAST,MULTICAST,UP,LOWER_UP>
eno2 DOWN ac:1f:6b:e0:b1:0d <BROADCAST,MULTICAST>
enp1s0f2 UP 64:9d:99:b1:31:ad <BROADCAST,MULTICAST,UP,LOWER_UP>
enp1s0f3 UP 64:9d:99:b1:31:ac <BROADCAST,MULTICAST,UP,LOWER_UP>
bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
loop0 UP de:ad:00:00:00:00 <BROADCAST,MULTICAST,UP,LOWER_UP>
eno1.205@eno1 UP ac:1f:6b:e0:b1:0c <BROADCAST,MULTICAST,UP,LOWER_UP>
eno1.992@eno1 UP ac:1f:6b:e0:b1:0c <BROADCAST,MULTICAST,UP,LOWER_UP>
enp1s0f3.501@enp1s0f3 UP 64:9d:99:b1:31:ac <BROADCAST,MULTICAST,UP,LOWER_UP>
enp1s0f3.511@enp1s0f3 UP 64:9d:99:b1:31:ac <BROADCAST,MULTICAST,UP,LOWER_UP>
bond0.100@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
bond0.101@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
bond0.105@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
bond0.130@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
bond0.2502@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
root@eunetworks-2:~# ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
eno1 UP fe80::ae1f:6bff:fee0:b10c/64
eno2 DOWN
enp1s0f2 UP 94.142.247.246/31 2a02:898:0:300::3/128 2a02:898:0:301::/127 fe80::669d:99ff:feb1:31ad/64
enp1s0f3 UP fe80::669d:99ff:feb1:31ac/64
bond0 UP fe80::669d:99ff:feb1:31af/64
loop0 UP 94.142.247.3/32 fe80::dcad:ff:fe00:0/64
eno1.205@eno1 UP 62.115.144.33/31 2001:2000:3080:ebc::2/126 fe80::ae1f:6bff:fee0:b10c/64
eno1.992@eno1 UP 87.255.32.130/30 2a00:ec8::102/126 fe80::ae1f:6bff:fee0:b10c/64
enp1s0f3.501@enp1s0f3 UP 80.249.211.161/21 2001:7f8:1::a500:8283:1/64 fe80::669d:99ff:feb1:31ac/64
enp1s0f3.511@enp1s0f3 UP 194.62.128.38/24 2001:67c:608::f200:8283:1/64 fe80::669d:99ff:feb1:31ac/64
bond0.100@bond0 UP 94.142.240.253/24 2a02:898:0:20::e2/64 fe80::669d:99ff:feb1:31af/64
bond0.101@bond0 UP 172.28.3.253/24 fe80::669d:99ff:feb1:31af/64
bond0.105@bond0 UP 185.52.225.14/28 2a02:898:0:21::e2/64 fe80::669d:99ff:feb1:31af/64
bond0.130@bond0 UP 94.142.247.242/31 2a02:898:0:301::14/127 fe80::669d:99ff:feb1:31af/64
bond0.2502@bond0 UP 37.139.140.27/31 2a00:a7c0:e20b:104::2/126 fe80::669d:99ff:feb1:31af/64
```
{{< image width="400px" float="right" src="/assets/coloclue-vpp/traffic.jpg" alt="Traffic" >}}
Of course, VPP handles all the traffic _through_ the machine, and the only traffic that Linux will
see is that which is destined to the controlplane (eg, to one of the IPv4 or IPv6 addresses or
multicast/broadcast groups that they are participating in), so things like tcpdump or SNMP won't
really work.
However, due to my [[vpp-snmp-agent](https://github.com/pimvanpelt/vpp-snmp-agent.git)], which is
feeding as an AgentX behind an snmpd that in turn is running in the `dataplane` namespace, SNMP scrapes
work as they did before, albeit with a few different interface names.
**6.** Earlier, I had failed over `keepalived` and stopped the service. This way, the peer router
on `eunetworks-3` would pick up all outbound traffic to the virtual IPv4 and IPv6 for our users'
default gateway. Because we're mainly interested in non-intrusively measuring the BGP beacon (which
is forced to always go through this machine), and we know some of our members use BGP and take a
preference over this router because it's connected to AMS-IX, we make a decision to leave keepalived
turned off for now.
But, traffic is flowing, and in fact a little bit more throughput, possibly because traffic flows
faster when there's not 5% packet loss on certain egress paths? I don't know but OK, moving along!
## Results
Clearly VPP is a winner in this scenario. If you recall the traceroute from before the operation, the latency
was good up until `nlams0.ipng.ch`, after which loss occured and variance was very high. Rogier and
I let the VPP instance run overnight, and started this traceroute after our maintenance was
concluded:
```
My traceroute [v0.94]
squanchy.ipng.ch (194.1.163.90) -> 185.52.227.1 2023-02-25T09:48:46+0100
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. chbtl0.ipng.ch 0.0% 51796 0.6 0.2 0.1 1.7 0.2
2. chrma0.ipng.ch 0.0% 51796 1.6 1.0 0.9 5.5 1.2
3. defra0.ipng.ch 0.0% 51796 7.0 6.5 6.4 27.7 1.9
4. nlams0.ipng.ch 0.0% 51796 12.7 12.6 12.5 43.8 3.9
5. bond0-105.eunetworks-2.router.nl.coloclue.net 0.0% 51796 13.3 13.0 12.8 138.9 11.1
6. 185.52.227.1 0.0% 51796 13.6 12.7 12.3 46.6 8.3
```
{{< image width="400px" float="right" src="/assets/coloclue-vpp/bill-clinton-zero.gif" alt="Clinton Zero" >}}
This mtr shows clear network weather with absolutely no packets dropped from Br&uuml;ttisellen (near
Zurich, Switzerland) all the way to the BGP beacon running in EUNetworks in Amsterdam. Considering
I've been running VPP for a few years now, including writing the code necessary to plumb the
dataplane interfaces through to Linux so that a higher order control plane (such as Bird, or FRR)
can manipulate them, I am reasonably bullish, but I do hope to convert others.
This computer now forwards packets like a boss, its packet loss is &rarr;
Looking at the local situation, from a hypervisor running at IPng Networks in Equinix AM3 via FrysIX
through VPP and into the dataplane of the Coloclue router `eunetworks-2` , shows quite reasonable
throughput as well:
```
root@eunetworks-2:~# traceroute hvn0.nlams3.ipng.ch
traceroute to 46.20.243.179 (46.20.243.179), 30 hops max, 60 byte packets
1 enp1s0f3.eunetworks-3.router.nl.coloclue.net (94.142.247.247) 0.087 ms 0.078 ms 0.071 ms
2 frys-ix.ip-max.net (185.1.203.135) 1.288 ms 1.432 ms 1.479 ms
3 hvn0.nlams3.ipng.ch (46.20.243.179) 0.524 ms 0.534 ms 0.531 ms
root@eunetworks-2:~# iperf3 -c 46.20.243.179 -P 10
Connecting to host 46.20.243.179, port 5201
...
[SUM] 0.00-10.00 sec 6.70 GBytes 5.76 Gbits/sec 192 sender
[SUM] 0.00-10.03 sec 6.58 GBytes 5.64 Gbits/sec receiver
root@eunetworks-2:~# iperf3 -c 46.20.243.179 -P 10 -R
Connecting to host 46.20.243.179, port 5201
Reverse mode, remote host 46.20.243.179 is sending
...
[SUM] 0.00-10.03 sec 6.07 GBytes 5.20 Gbits/sec 54623 sender
[SUM] 0.00-10.00 sec 6.03 GBytes 5.18 Gbits/sec receiver
```
And the smokepings look just plain gorgeous:
--------------- | ---------------------
![After nlams01](/assets/coloclue-vpp/coloclue-beacon-after-nlams01.png) | ![After chbtl01](/assets/coloclue-vpp/coloclue-beacon-after-chbtl01.png)
_The screenshots above are smokeping from (left) a machine at AS8283 Coloclue (in Amsterdam, the
Netherlands), and from (right) a machine at AS8298 IPng (in Br&uuml;ttisellen, Switzerland), both
are showing no packetloss and clearly improved performance in end to end latency. Super!_
## What's next
The performance of the one router we upgraded definitely improved, no question about that. But
there's a couple of things that I think we still need to do, so Rogier and I rolled back the change
to the previous situation and kernel based routing.
* We didn't migrate keepalived, although IPng runs this in our DDLN [[colocation]({% post_url 2022-02-24-colo %})]
site, so I'm pretty confident that it will work.
* Kees and Ansible at Coloclue will need a few careful changes, to facilitate ongoing automation,
think of dataplane and controlplane firewalls, sysctls (uRPF et al), fastnetmon, and so on will
need a meaningful overhaul.
* There's an unknown dataplane hang with Bird IPv6 enables a `stub` OSPF interface on interface
`lo0`. We worked around that by putting the loopback IPv6 address on another interface, but this
needs to be fully understood.
* Completely unrelated to Coloclue, there's one dataplane hang regarding IPv6 RA/NS and/or BFD
and/or Linux Control Plane that the VPP developer community is hunting down - it happens with
my plugin but also with [[TNSR](http://www.netgate.com/tnsr)] (who used the upstream `linux-cp` plugin).
I've been working with a few folks from Netgate and customers of IPng Networks to try to find the root
cause, as AS8298 has been bitten by this a few times over the last ~quarter or so. I cannot
recommend in good faith running VPP until this is sorted out.
As an important side note, VPP is not well enough understood at Coloclue - rolling this out further
risks making me a single point of failure in the networking committee, and I'm not comfortable taking
that responsibility. I recommend that Coloclue network committee members gain experience with VPP,
DPDK, `vppcfg` and the other ecosystem tools, and that at least the bird6 OSPF issue and possible
IPv6 NS/RA issue are understood, before making the jump to the VPP world.

View File

@ -0,0 +1,633 @@
---
date: "2023-03-11T11:56:54Z"
title: 'Case Study: Centec MPLS Core'
---
After receiving an e-mail from [[Starry Networks](https://starry-networks.com)], I had a chat with their
founder and learned that the combination of switch silicon and software may be a good match for IPng Networks.
I got pretty enthusiastic when this new vendor claimed VxLAN, GENEVE, MPLS and GRE at 56 ports and
line rate, on a really affordable budget ($4'200,- for the 56 port; and $1'650,- for the 26 port
switch). This reseller is using a less known silicon vendor called
[[Centec](https://www.centec.com/silicon)], who have a lineup of ethernet chipsets. In this device,
the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired
with 4x100GbE uplink capability. This is Centec's fourth generation, so CTC8096 inherits the feature
set from L2/L3 switching to advanced data center and metro Ethernet features with innovative
enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE
ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS
SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic
management, and Network time synchronization.
{{< image width="450px" float="left" src="/assets/oem-switch/S5624X-front.png" alt="S5624X Front" >}}
{{< image width="450px" float="right" src="/assets/oem-switch/S5648X-front.png" alt="S5648X Front" >}}
<br /><br />
After discussing basic L2, L3 and Overlay functionality in my [[first post]({% post_url
2022-12-05-oem-switch-1 %})], and explored the functionality and performance of MPLS and VPLS in my
[[second post]({% post_url 2022-12-09-oem-switch-2 %})], I convinced myself and committed to a bunch
of these for IPng Networks. I'm now ready to roll out these switches and create a BGP-free core
network for IPng Networks. If this kind of thing tickles your fancy, by all means read on :)
## Overview
You may be wondering what folks mean when they talk about a [[BGP Free
Core](https://bgphelp.com/2017/02/12/bgp-free-core/)], and also you may ask yourself why would I
decide to retrofit this in our network. For most, operating this way gives very little room for
outages to occur in the L2 (Ethernet and MPLS) transport network, because it's relatively simple in
design and implementation. Some advantages worth mentioning:
* Transport devices do not need to be capable of supporting a large number of IPv4/IPv6 routes, either
in the RIB or FIB, allowing them to be much cheaper.
* As there is no eBGP, transport devices will not be impacted by BGP-related issues, such as high CPU
utilization during massive BGP re-convergence.
* Also, without eBGP, some of the attack vectors in ISPs (loopback DDoS or ARP storms on public
internet exchange, to take two common examples) can be eliminated. If a new BGP security
vulnerability were to be discovered, transport devices aren't impacted.
* Operator errors (the #1 reason for outages in our industry) associated with BGP configuration and
the use of large RIBs (eg. leaking into IGP, flapping transit sessions, etc) can be eradicated.
* New transport services such as MPLS point to point virtual leased lines, SR-MPLS, VPLS clouds, and
eVPN can all be introduced without modifying the routing core.
If deployed correctly, this type of transport-only network can be kept entirely isolated from the Internet,
making DDoS and hacking attacks against transport elements impossible, and it also opens up possibilities
for relatively safe sharing of infrastructure resources between ISPs (think of things like dark fibers
between locations, rackspace, power, cross connects).
For smaller clubs (like IPng Networks), being able to share a 100G wave with others, significantly reduces
price per Megabit! So if you're in Zurich, Switzerland, or Europe and find this an interesting avenue to
expand your reach in a co-op style environment, [[reach out](/s/contact)] to us, any time!
### Hybrid Design
I've decided to make this the direction of IPng's core network -- I know that the specs of the
Centec switches I've bought will allow for a modest but not huge amount of routes in the hardware
forwarding tables. I loadtested them in [[a previous article]({% post_url 2022-12-05-oem-switch-1
%})] at line rate (well, at least 8x10G at 64b packets and around 110Mpps), so they were forwarding
both IPv4 and MPLS traffic effortlessly, and at 45 Watts I might add! However, they clearly cannot
operate in the DFZ for two main reasons:
1. The FIB is limited to 12K IPv4, 2K IPv6 entries, so they can't hold a full table
1. The CPU is a bit whimpy so it won't be happy doing large BGP reconvergence operations
IPng Networks has three (3) /24 IPv4 networks, which means we're not swimming in IPv4 addresses.
But, I'm possibly the world's okayest systems engineer, and I happen to know that most things don't
really need an IPv4 address anymore. There's all sorts of fancy loadbalancers like
[[NGINX](https://nginx.org)] and [[HAProxy](https://www.haproxy.org/)] which can take traffic (UDP,
TCP or higher level constructs like HTTP traffic), provide SSL offloading, and then talk to one or
more loadbalanced backends to retrieve the actual content.
#### IPv4 versus IPv6
{{< image float="right" src="/assets/mpls-core/go-for-it-90s.gif" alt="The 90s" >}}
Most modern operating systems can operate in IPv6-only mode, certainly the Debian and Ubuntu and
Apple machines that are common in my network are happily dual-stack and probably mono-stack as well.
Seeing that I've been running IPv6 since, eh, the 90s (my first allocation was on the 6bone in
1996, and I did run [[SixXS](https://sixxs.net/)] for longer than I can remember!).
You might be inclined to argue that I should be able to advance the core of my serverpark to
IPv6-only ... but unfortunately that's not only up to me, as it has been mentioned to me a number of times
that my [[Videos](https://video.ipng.ch/)] are not reachable, which of course they are, but only if
your computer speaks IPv6.
In addition to my stuff needing _legacy_ reachability, some external websites, including pretty big
ones (I'm looking at you, [[GitHub](https://github.com/)] and [[Cisco
T-Rex](https://trex-tgn.cisco.com/)]) are still IPv4 only, and some network gear still hasn't
really caught on to the IPv6 control- and management plane scene (for example, SNMP traps or
scraping, BFD, LDP, and a few others, even in a modern switch like the Centecs that I'm about to
deploy).
#### AS8298 BGP-Free Core
I have a few options -- I could be stubborn and do NAT64 for an IPv6-only internal network. But if I'm going
to be doing NAT anyway, I decide to make a compromise and deploy my new network using private IPv4
space alongside public IPv6 space, and to deploy a few strategically placed border gateways that can
do the translation and frontending for me.
There's quite a few private/reserved IPv4 ranges on the internet, which the current LIRs on the
RIPE [[Waiting List](https://www.ripe.net/manage-ips-and-asns/ipv4/ipv4-waiting-list)] are salivating
all over, gross. However, there's a few ones beyond canonical [[RFC1918](https://www.rfc-editor.org/rfc/rfc1918)]
that are quite frequently used in enterprise networking, for example by large Cloud providers. They build
what is called a _Virtual Private Cloud_ or
[[VPC](https://www.cloudflare.com/learning/cloud/what-is-a-virtual-private-cloud/)]. And if they can
do it, so can I!
#### Numberplan
Let me draw your attention to [[RFC5735](https://www.rfc-editor.org/rfc/rfc5735)], which describes
special use IPv4 addresses. One of these is **198.18.0.0/15**: this block has been allocated for use
in benchmark tests of network interconnect devices. What I found interesting, is that
[[RFC2544](https://www.rfc-editor.org/rfc/rfc2544)] explains that this range was assigned to minimize the
chance of conflict in case a testing device were to be accidentally connected to part of the Internet.
Packets with source addresses from this range are not meant to be forwarded across the Internet.
But, they can _totally_ be used to build a pan-european private network that is not directly connected
to the internet. I grab my free private Class-B, like so:
* For IPv4, I take the second /16 from that to use as my IPv4 block: **198.19.0.0/16**.
* For IPv6, I carve out a small part of IPng's own IPv6 PI block: **2001:678:d78:500::/56**
First order of business is to create a simple numberplan that's not totally broken:
Purpose | IPv4 Prefix | IPv6 Prefix
--------------|-------------------|------------------
Loopbacks | 198.19.0.0/24 (size /32) | 2001:678:d78:500::/64 (size /128)
P2P Networks | 198.19.2.0/23 (size /31) | 2001:678:d78:501::/64 (size /112)
Site Local Networks | 198.19.4.0/22 (size /27) | 2001:678:d78:502::/56 (size /64)
This simple start leaves most of the IPv4 space allocatable for the future, while giving me lots of
IPv4 and IPv6 addresses to retrofit this network in all sites where IPng is present, which is
[[quite a few](https://as8298.peeringdb.com/)]. All of **198.19.1.0/24** (reserved either for P2P
networks or for loopbacks, whichever I'll need first), **198.19.8.0/21**, **198.19.16.0/20**,
**198.19.32.0/19**, **198.19.64.0/18** and **198.19.128.0/17** will be ready for me to use in the
future, and they are all nicely tucked away under one **19.198.in-addr.arpa** reversed domain, which
I stub out on IPng's resolvers. Winner!
### Inserting MPLS Under AS8298
I am currently running [[VPP](https://fd.io)] based on my own deployment [[article]({% post_url
2021-09-21-vpp-7%})], and this has a bunch of routers connected back-to-back with one another using
either crossconnects (if there are multiple routers in the same location), or a CWDM/DWDM wave over
dark fiber (if they are in adjacent buildings and I have found a provider willing to share their
dark fiber with me), or a Carrier Ethernet virtual leased line (L2VPN, provided by folks like
[[Init7](https://init7.net)] in Switzerland, or [[IP-Max](https://ip-max.net)] throughout europe in
our [[backbone]({% post_url 2021-02-27-network %})]).
{{< image width="350px" float="right" src="/assets/mpls-core/before.svg" alt="Before" >}}
Most of these links are actually "just" point to point ethernet links, which I can use untagged (eg
`xe1-0`), or add any _dot1q_ sub-interfaces (eg `xe1-0.10`). In some cases, the ISP will deliver the
circuit to me with an additional _outer_ tag, in which case I can still use that interface (eg
`xe1-0.400`) and create _qinq_ sub-interfaces (eg `xe1-0.400.10`).
In January 2023, my Zurich metro deployment looks a bit like the top drawing to the right. Of
course, these routers connect to all sorts of other things, like internet exchange points
([[SwissIX](https://swissix.net/)], [[CHIX](https://ch-ix.ch/)],
[[CommunityIX](https://communityrack.org/)], and [[FreeIX](https://free-ix.net/)]), IP transit
upstreams (in Zurich mainly [[IP-Max](https://as25091.peeringdb.com/)] and
[[Openfactory](https://as58299.peeringdb.com/)]), and customer downstreams, colocation space,
private network interconnects with others, and so on.
I want to draw your attention to the four _main_ links between these routers:
1. Orange (bottom): chbtl0 and chbtl1 are at our offices in Br&uuml;ttisellen; they're in two
separate racks, and have 24 fibers between them. Here, the two routers connect back to back with a
25G SFP28 optic at 1310nm.
1. Blue (top): Between chrma0 (at NTT datacenter in R&uuml;mlang) and chgtg0 (at Interxion datacenter
in Glattbrugg), IPng rents a CWDM wave from Openfactory, so the two routers here connect back to
back also, albeit over 4.2km of dark fiber between the two datacenters, with a 25G SFP28 optic at 1610nm.
1. Red (left): Between chbtl0 and chrma0, Init7 provides a 10G L2VPN over MPLS ethernet circuit,
starting in our offices with a BiDi 10G optic, and delivered at NTT on a BiDi 10G optic as well (we
did this, so that the cross connect between our racks might in the future be able to use the other
fiber). Init7 delivers both ports tagged VLAN 400.
1. Green (right): Between chbtl1 and chgtg0, Openfactory provides a 10G VLAN ethernet circuit,
starting in our offices with a BiDi 10G optic to the local telco, and then transported over dark
fiber by UPC to Interxion. Openfactory delivers both sides tagged VLAN 200-209 to us.
This is a super fun puzzle! I am running a live network, with customers, and I want to retrofit this
MPLS network _underneath_ my existing network, and after thinking about it for a while, I see how I
can do it.
{{< image width="350px" float="right" src="/assets/mpls-core/after.svg" alt="After" >}}
To avoid using the link, I raise OSPF cost for the link chbtl0-chrma0, the red link in the graph.
Traffic will now flow via chgtg0 and through chbtl1. After I've taken the link out of service, I
make a few important changes:
1. First, I move the interface on both VPP routers from it's _dot1q_ tagged `xe1-0.400`, to a double
tagged `xe1-0.400.10`. Init7 will pass this through for me, and after I make the change, I can ping
both sides again (with a subtle loss of 4 bytes because of the second tag).
1. Next, I unplug the Init7 link on both sides and plug them into a TenGig port on a Centec switch
that I deployed in both sites, and I take a second TenGig port and I plug that into the router. I
make both ports a _trunk_ mode switchport, and allow VLAN 400 tagged on it.
1. Finally, on the switch I create interface `vlan400` on both sides, and the two switches can see
each other directly connected now on the single-tagged interface, while the two routers can see each
other directly connected now on the double-tagged interface.
With the _red_ leg taken care of, I ask the kind folks from Openfactory if they would mind if I use
a second wavelength for the duration of my migration, which they kindly agree to. So, I plug a new
CWDM 25G optic on another channel (1270nm), and bring the network to Glattbrugg, where I deploy a
Centec switch.
With the _blue_/_purple_ leg taken care of, all I have to do is undrain the _red_ link (lower OSPF
cost) while draining the _green_ link (raising its OSPF cost). Traffic now flips back from chgtg0
through chrma0 and into chbtl0. I can rinse and repeat the green leg, moving the interfaces on the
routers to a double-tagged `xe1-0.200.10` on both sides, inserting and moving the _green_ link from
the routers into the switches, and connecting them in turn to the routers.
## Configuration
And just like that, I've inserted a triangle of Centec switches without disrupting any traffic,
would you look at that! They are however, still "just" switches, each with two ports sharing
the _red_ VLAN 400 and the _green_ VLAN 200, and doing ... decidedly nothing on the _purple_ leg,
as those ports aren't even switchports!
Next up: configuring these switches to become, you guessed it, routers!
### Interfaces
I will take the switch at NTT R&uuml;mlang as an example, but the switches really are all very
similar. First, I define the loopback addresses and transit networks to chbtl0 (_red_ link) and
chgtg0 (_purple_ link).
```
interface loopback0
description Core: msw0.chrma0.net.ipng.ch
ip address 198.19.0.2/32
ipv6 address 2001:678:d78:500::2/128
!
interface vlan400
description Core: msw0.chbtl0.net.ipng.ch (Init7)
mtu 9172
ip address 198.19.2.1/31
ipv6 address 2001:678:d78:501::2/112
!
interface eth-0-38
description Core: msw0.chgtg0.net.ipng.ch (UPC 1270nm)
mtu 9216
ip address 198.19.2.4/31
ipv6 address 2001:678:d78:501::2:1/112
```
I need to make sure that the MTU is correct on both sides (this will be important later when OSPF is
turned on), and I ensure that the underlay has sufficient MTU (in the case of Init7, as the _purple_
interface goes over dark fiber with no active equipment in between!) I issue a set of ping commands
ensuring that the dont-fragment bit is set and the size of the resulting IP packet is exactly that
which my MTU claims I should allow, and validate that indeed, we're good.
### OSPF, LDP, MPLS
For OSPF, I am certain that this network should never carry or propagate anything other than the
**198.19.0.0/16** and **2001:678:d78:500::/56** networks that I have assigned to it, even if it were
to be connected to other things (like an out-of-band connection, or even AS8298), so as belt-and-braces
style protection I take the following base-line configuration:
```
ip prefix-list pl-ospf seq 5 permit 198.19.0.0/16 le 32
ipv6 prefix-list pl-ospf seq 5 permit 2001:678:d78:500::/56 le 128
!
route-map ospf-export permit 10
match ipv6 address prefix-list pl-ospf
route-map ospf-export permit 20
match ip address prefix-list pl-ospf
route-map ospf-export deny 9999
!
router ospf
router-id 198.19.0.2
redistribute connected route-map ospf-export
redistribute static route-map ospf-export
network 198.19.0.0/16 area 0
!
router ipv6 ospf
router-id 198.19.0.2
redistribute connected route-map ospf-export
redistribute static route-map ospf-export
!
ip route 198.19.0.0/16 null0
ipv6 route 2001:678:d78:500::/56 null0
```
I also set a static discard by means of a _nullroute_, for the space beloning to the private
network. This way, packets will not loop around if there is not a more specific for them in OSPF.
The route-map ensures that I'll only be advertising _our space_, even if the switches eventually get
connected to other networks, for example some out-of-band access mechanism.
Next up, enabling LDP and MPLS, which is very straight forward. In my interfaces, I'll add the
***label-switching*** and ***enable-ldp*** keywords, as well as ensure that the OSPF and OSPFv3
speakers on these interfaces know that they are in __point-to-point__ mode. For the cost, I will
start off with the cost in tenths of milliseconds, in other words, if the latency between chbtl0 and
chrma0 is 0.8ms, I will set the cost to 8:
```
interface vlan400
description Core: msw0.chbtl0.net.ipng.ch (Init7)
mtu 9172
label-switching
ip address 198.19.2.1/31
ipv6 address 2001:678:d78:501::2/112
ip ospf network point-to-point
ip ospf cost 8
ip ospf bfd
ipv6 ospf network point-to-point
ipv6 ospf cost 8
ipv6 router ospf area 0
enable-ldp
!
router ldp
router-id 198.19.0.2
transport-address 198.19.0.2
!
```
The rest is really just rinse-and-repeat. I loop around all relevant interfaces, and see all of
OSPF, OSPFv3, and LDP adjacencies form:
```
msw0.chrma0# show ip ospf nei
OSPF process 0:
Neighbor ID Pri State Dead Time Address Interface
198.19.0.0 1 Full/ - 00:00:35 198.19.2.0 vlan400
198.19.0.3 1 Full/ - 00:00:39 198.19.2.5 eth-0-38
msw0.chrma0# show ipv6 ospf nei
OSPFv3 Process (0)
Neighbor ID Pri State Dead Time Interface Instance ID
198.19.0.0 1 Full/ - 00:00:37 vlan400 0
198.19.0.3 1 Full/ - 00:00:39 eth-0-38 0
msw0.chrma0# show ldp session
Peer IP Address IF Name My Role State KeepAlive
198.19.0.0 vlan400 Active OPERATIONAL 30
198.19.0.3 eth-0-38 Active OPERATIONAL 30
```
### Connectivity
{{< image float="right" src="/assets/mpls-core/backbone.svg" alt="Backbone" >}}
And after I'm done with this heavy lifting, I can now build MPLS services (like L2VPN and VPLS) on
these three switches. But as you may remember, IPng is in a few more sites than just
Br&uuml;ttisellen, R&uuml;mlang and Glattbrugg. While a lot of work, retrofitting every
site in exactly the same way is not mentally challenging, so I'm not going to spend a lot of words
describing it. Wax on, wax off.
Once I'm done though, the (MPLS) network looks a little bit like this. What's really cool about it,
is that it's an fully capable IPv4 and IPv6 network running OSPF and OSPFv3, LDP and MPLS services,
albeit one that's not connected to the internet, yet. This means that I've successfully created both
a completely private network that spans all sites we have active equipment in, but also did not
stand in the way of our public facing (VPP) routers in AS8298. Customers haven't noticed a single
thing, except now they can benefit from any L2 services (using MPLS tunnels or VPLS clouds) from any
of our sites. Neat!
Our VPP routers are connected through the switches, (carrier) L2VPN and WDM waves just as they were
before, but carried transparently by the Centec switches. Performance wise, there is no regression,
because the switches do line rate L2/MPLS switching and L3 forwarding. This means that the VPP
routers, except for having a little detour in-and-out the switch for their long haul, have the same
throughput as they had before.
I will deploy three additional features, to make this new private network a fair bit more powerful:
**1. Site Local Connectivity**
Each switch gets what is called an IPng Site Local (or _ipng-sl_) interface. This is a /27 IPv4 and
a /64 IPv6 that is bound on a local VLAN on each switch on our private network. Remember: the links
_between_ sites are no longer switched, they are _routed_ and pass ethernet frames only using MPLS.
I can connect for example all of the fleet's hypervisors to this internal network. I have given our
three bastion _jumphosts_ (Squanchy, Glootie and Pencilvester) an address on this internal network
as well, just look at this beautiful result:
```
pim@hvn0-ddln0:~$ traceroute hvn0.nlams3.net.ipng.ch
traceroute to hvn0.nlams3.net.ipng.ch (198.19.4.98), 64 hops max, 40 byte packets
1 msw0.ddln0.net.ipng.ch (198.19.4.129) 1.488 ms 1.233 ms 1.102 ms
2 msw0.chrma0.net.ipng.ch (198.19.2.1) 2.138 ms 2.04 ms 1.949 ms
3 msw0.defra0.net.ipng.ch (198.19.2.13) 6.207 ms 6.288 ms 7.862 ms
4 msw0.nlams0.net.ipng.ch (198.19.2.14) 13.424 ms 13.459 ms 13.513 ms
5 hvn0.nlams3.net.ipng.ch (198.19.4.98) 12.221 ms 12.131 ms 12.161 ms
pim@hvn0-ddln0:~$ iperf3 -6 -c hvn0.nlams3.net.ipng.ch -P 10
Connecting to host hvn0.nlams3, port 5201
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 9.00-10.00 sec 60.0 MBytes 503 Mbits/sec 0 1.47 MBytes
[ 7] 9.00-10.00 sec 71.2 MBytes 598 Mbits/sec 0 1.73 MBytes
[ 9] 9.00-10.00 sec 61.2 MBytes 530 Mbits/sec 0 1.30 MBytes
[ 11] 9.00-10.00 sec 91.2 MBytes 765 Mbits/sec 0 2.16 MBytes
[ 13] 9.00-10.00 sec 88.8 MBytes 744 Mbits/sec 0 2.13 MBytes
[ 15] 9.00-10.00 sec 62.5 MBytes 524 Mbits/sec 0 1.57 MBytes
[ 17] 9.00-10.00 sec 60.0 MBytes 503 Mbits/sec 0 1.47 MBytes
[ 19] 9.00-10.00 sec 65.0 MBytes 561 Mbits/sec 0 1.39 MBytes
[ 21] 9.00-10.00 sec 61.2 MBytes 530 Mbits/sec 0 1.24 MBytes
[ 23] 9.00-10.00 sec 63.8 MBytes 535 Mbits/sec 0 1.58 MBytes
[SUM] 9.00-10.00 sec 685 MBytes 5.79 Gbits/sec 0
...
[SUM] 0.00-10.00 sec 7.38 GBytes 6.34 Gbits/sec 177 sender
[SUM] 0.00-10.02 sec 7.37 GBytes 6.32 Gbits/sec receiver
```
**2. Egress Connectivity**
Having a private network is great, as it allows me to run the entire internal environment with 9000
byte jumboframes, mix IPv4 and IPv6, segment off background tasks such as ZFS replication and
borgbackup between physical sites, and employ monitoring with Prometheus and LibreNMS and log in
safely with SSH or IPMI without ever needing to leave the safety of the walled garden that is
**198.19.0.0/16**.
Hypervisors will now typically get a management interface _only_ in this network, and for them to be
able to do things like run _apt upgrade_, some remote repositories will need to be reachable over
IPv4 as well. For this, I decide to add three internet gateways, which will have one leg into the
private network, and one leg out into the world. For IPv4 they'll provide NAT, and for IPv6 they'll
ensure only _trusted_ traffic can enter the private network.
These gateways will:
* Connect to the internal network with OSPF and OSPFv3:
* They will learn 198.19.0.0/16, 2001:687:d78:500::/56 and their more specifics from it
* They will inject a default route for 0.0.0.0/0 and ::/0 to it
* Connect to AS8298 with BGP:
* They will receive a default IPv4 and IPv6 route from AS8298
* They will announce the two aggregate prefixes to it with **no-export** community set
* Provide a WireGuard endpoint to allow remote management:
* Clients will be put in 192.168.6.0/24 and 2001:678:d78:300::/56
* These ranges will be announced both to AS8298 externally and to OSPF internally
This provides dynamic routing at its best. If the gateway, the physical connection to the internal
network, or the OSPF adjacency is down, AS8298 will not learn the routes into the internal network
at this node. If the gateway, the physical connection to the external network, or the BGP adjacency
is down, the Centec switch will not pick up the default routes, and no traffic will be sent through
it. By having three such nodes geographically separated (one in Br&uuml;ttisellen, one in
Plan-les-Ouates and one in Amsterdam), I am very likely to have stable and resilient connectivity.
At the same time, these three machines serve as WireGuard endpoints to be able to remotely manage
the network. For this purpose, I've carved out **192.168.6.0/26** and **2001:678:d78:300::/56** and
will hand out IP addresses from those to clients. I'd like these two networks to have access to the
internal private network as well.
The Bird2 OSPF configuration for one of the nodes (in Br&uuml;ttisellen) looks like this:
```
filter ospf_export {
if (net.type = NET_IP4 && net ~ [ 0.0.0.0/0, 192.168.6.0/26 ]) then accept;
if (net.type = NET_IP6 && net ~ [ ::/0, 2001:678:d78:300::/64 ]) then accept;
if (source = RTS_DEVICE) then accept;
reject;
}
filter ospf_import {
if (net.type = NET_IP4 && net ~ [ 198.19.0.0/16 ]) then accept;
if (net.type = NET_IP6 && net ~ [ 2001:678:d78:500::/56 ]) then accept;
reject;
}
protocol ospf v2 ospf4 {
debug { events };
ipv4 { export filter ospf_export; import filter ospf_import; };
area 0 {
interface "lo" { stub yes; };
interface "wg0" { stub yes; };
interface "ipng-sl" { type broadcast; cost 15; bfd on; };
};
}
protocol ospf v3 ospf6 {
debug { events };
ipv6 { export filter ospf_export; import filter ospf_import; };
area 0 {
interface "lo" { stub yes; };
interface "wg0" { stub yes; };
interface "ipng-sl" { type broadcast; cost 15; bfd off; };
};
}
```
The ***ospf_export*** filter is what we're telling the Centec switches. Here, precisely the default
route and the WireGuard space is announced, in addition to connected routes. The ***ospf_import***
is what we're willing to learn from the Centec switches, and here we will accept exactly the
aggregate **198.19.0.0/16** and **2001:678:d78:500::/56** prefixes belonging to the private internal
network.
The Bird2 BGP configuration for this gateway then looks like this:
```
filter bgp_export {
if (net.type = NET_IP4 && ! (net ~ [ 198.19.0.0/16, 192.168.6.0/26 ])) then reject;
if (net.type = NET_IP6 && ! (net ~ [ 2001:678:d78:500::/56, 2001:678:d78:300::/64 ]) then reject;
# Add BGP Wellknown community no-export (FFFF:FF01)
bgp_community.add((65535,65281));
accept;
}
template bgp T_GW4 {
local as 64512;
source address 194.1.163.72;
default bgp_med 0;
default bgp_local_pref 400;
ipv4 { import all; export filter bgp_export; next hop self on; };
}
template bgp T_GW6 {
local as 64512;
source address 2001:678:d78:3::72;
default bgp_med 0;
default bgp_local_pref 400;
ipv6 { import all; export filter bgp_export; next hop self on; };
}
protocol bgp chbtl0_ipv4_1 from T_GW4 { neighbor 194.1.163.66 as 8298; };
protocol bgp chbtl1_ipv4_1 from T_GW4 { neighbor 194.1.163.67 as 8298; };
protocol bgp chbtl0_ipv6_1 from T_GW6 { neighbor 2001:678:d78:3::2 as 8298; };
protocol bgp chbtl1_ipv6_1 from T_GW6 { neighbor 2001:678:d78:3::3 as 8298; };
```
The ***bgp_export*** filter is where we restrict our announcements to only exactly the prefixes
we've learned from the Centec, and WireGuard. We'll set the _no-export_ BGP community on it, which
will allow the prefixes to live in AS8298 but never be announced to any eBGP peers. If the any of
the machine, the BGP session, the WireGuard interface, or the default route, would be missing, they
would simply not be announced. In the other direction, if the Centec is not feeding the gateway its
prefixes via OSPF, the BGP session may be up, but it will not be propagating these prefixes, and the
gateway will not attract network traffic to it. There are two BGP uplinks to AS8298 here, which also
provides resilience in case one of them is down for maintenance or in fault condition. __N+k__ is a
great rule to live by, when it comes to network engineering.
The last two things I should provide on each gateway, is **(A)** a NAT translator from internal to
external, and **(B)** a firewall that ensures only authorized traffic gets passed to the Centec
network.
First, I'll provide an IPv4 NAT translation to the internet facing AS8298 (`ipng`), for traffic
that is coming from WireGuard or the private network, while allowing it to pass _between_ the two
networks without performing NAT. The first rule says to jump to __ACCEPT__ (skipping the NAT rules),
if the source is WireGuard. The second two rules say to provide NAT towards the internet for any
traffic coming from WireGuard or the private network. The fourth and last rule says to provide NAT
towards the _internal_ private network, so that anything trying to get into the network will be
coming from an address in **198.19.0.0/16** as well. Here they are:
```
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -o ipng-sl -j ACCEPT
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -o ipng -j MASQUERADE
iptables -t nat -A POSTROUTING -s 198.19.0.0/16 -o ipng -j MASQUERADE
iptables -t nat -A POSTROUTING -o ipng-sl -j MASQUERADE
```
**3. Ingress Connectivity**
For inbound traffic, the rules are similarly permissive for _trusted_ sources but otherwise prohibit
any passing traffic. Prefixes are allowed to be forwarded from WireGuard, and some (not
disclosed, cuz I'm not stoopid!) trusted prefixes for IPv4 and IPv6, but ultimately if not specified
the forwarding tables will end in a default policy of __DROP__, which means no traffic will be
passed into the WireGuard or Centec internal networks unless explicitly allowed here:
```
iptables -P FORWARD DROP
ip6tables -P FORWARD DROP
for SRC4 in 192.168.6.0/24 ...; do
iptables -I FORWARD -s $SRC4 -j ACCEPT
done
for SRC6 in 2001:678:d78:300::/56 ...; do
ip6tables -I FORWARD -s $SRC6 -j ACCEPT
done
```
With that, any machine in the Centec (and WireGuard) private internal network will have full access
amongst each other, and they will be NATed to the internet, through these three (N+2) gateways. If I
turn one of them off, things look like this:
```
pim@hvn0-ddln0:~$ traceroute 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
1 msw0.ddln0.net.ipng.ch (198.19.4.129) 0.733 ms 1.040 ms 1.340 ms
2 msw0.chrma0.net.ipng.ch (198.19.2.6) 1.249 ms 1.555 ms 1.799 ms
3 msw0.chbtl0.net.ipng.ch (198.19.2.0) 2.733 ms 2.840 ms 2.974 ms
4 hvn0.chbtl0.net.ipng.ch (198.19.4.2) 1.447 ms 1.423 ms 1.402 ms
5 chbtl0.ipng.ch (194.1.163.66) 1.672 ms 1.652 ms 1.632 ms
6 chrma0.ipng.ch (194.1.163.17) 2.414 ms 2.431 ms 2.322 ms
7 as15169.lup.swissix.ch (91.206.52.223) 2.353 ms 2.331 ms 2.311 ms
...
pim@hvn0-chbtl0:~$ sudo systemctl stop bird
pim@hvn0-ddln0:~$ traceroute 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
1 msw0.ddln0.net.ipng.ch (198.19.4.129) 0.770 ms 1.058 ms 1.311 ms
2 msw0.chrma0.net.ipng.ch (198.19.2.6) 1.251 ms 1.662 ms 2.036 ms
3 msw0.chplo0.net.ipng.ch (198.19.2.22) 5.828 ms 5.455 ms 6.064 ms
4 hvn1.chplo0.net.ipng.ch (198.19.4.163) 4.901 ms 4.879 ms 4.858 ms
5 chplo0.ipng.ch (194.1.163.145) 4.867 ms 4.958 ms 5.113 ms
6 chrma0.ipng.ch (194.1.163.50) 9.274 ms 9.306 ms 9.313 ms
7 as15169.lup.swissix.ch (91.206.52.223) 10.168 ms 10.127 ms 10.090 ms
...
```
{{< image width="200px" float="right" src="/assets/mpls-core/swedish_chef.jpg" alt="Chef's Kiss" >}}
**How cool is that :)** First I do a traceroute from the hypervisor pool in DDLN colocation site, which
finds its closest default at `msw0.chbtl0.net.ipng.ch` which exits via `hvn0.chbtl0` and into the public
internet. Then, I shut down bird on that hypervisor/gateway, which means it won't be advertising the
default into the private network, nor will it be picking up traffic to/from it. About one second
later, the next default route is found to be at `msw0.chplo0.net.ipng.ch` over its hypervisor in
Geneva (note, 4ms down the line), after which the egress is performed at `hvn1.chplo0` into the
public internet. Of course, it's then sent back to Zurich to still find its way to Google at
SwissIX, but the only penalty is a scenic route: looping from Br&uuml;ttisellen to Geneva and back
adds pretty much 8ms of end to end latency.
Just look at that beautiful resillience at play. Chef's kiss.
## What's next
The ring hasn't been fully deployed yet. I am waiting on a backorder of switches from Starry
Networks, due to arrive early April. The delivery of those will allow me to deploy in Paris and Lille,
hopefully in a cool roadtrip with Fred :)
But, I got pretty far, so what's next for me is the following few fun things:
1. Start offering EoMPLS / L2VPN / VPLS services to IPng customers. Who wants some?!
1. Move replication traffic from the current public internet, towards the internal _private_ network.
This both can leverage 9000 byte jumboframes, but it can also use wirespeed forwarding from the Centec
network gear.
1. Move all unneeded IPv4 addresses into the _private_ network, such as maintenance and management
/ controlplane, route reflectors, backup servers, hypervisors, and so on.
1. Move frontends to be dual-homed as well: one leg towards AS8298 using Public IPv4 and IPv6
addresses, and then finding backend servers in the private network (think of it like an NGINX
frontend that terminmates the HTTP/HTTPS connection [_SSL is inserted and removed here :)_], and then
has one or more backend servers in the private network. This can be useful for Mastodon, Peertube,
and of course our own websites.

View File

@ -0,0 +1,481 @@
---
date: "2023-03-17T10:56:54Z"
title: 'Case Study: Site Local NGINX'
---
A while ago I rolled out an important change to the IPng Networks design: I inserted a bunch of
[[Centec MPLS](https://starry-networks.com)] and IPv4/IPv6 capable switches underneath
[[AS8298]({% post_url 2021-02-27-network %})], which gave me two specific advantages:
1. The entire IPng network is now capable of delivering L2VPN services, taking the form of MPLS
point-to-point ethernet, and VPLS, as shown in a previous [[deep dive]({% post_url
2022-12-09-oem-switch-2 %})], in addition to IPv4 and IPv6 transit provided by VPP in an elaborate
and elegant [[BGP Routing Policy]({% post_url 2021-11-14-routing-policy %})].
1. A new internal private network becomes available to any device connected IPng switches, with
addressing in **198.19.0.0/16** and **2001:678:d78:500::/56**. This network is completely isolated
from the Internet, with access controlled via N+2 redundant gateways/firewalls, described in more
detail in a previous [[deep dive]({% post_url 2023-03-11-mpls-core %})] as well.
## Overview
{{< image width="220px" float="left" src="/assets/ipng-frontends/soad.png" alt="Toxicity" >}}
After rolling out this spiffy BGP Free [[MPLS Core]({% post_url 2023-03-11-mpls-core %})], I wanted
to take a look at maybe conserving a few IP addresses here and there, as well as tightening access
and protecting the more important machines that IPng Networks runs. You see, most enterprise
networks will include a bunch of internal services, like databases, network attached storage, backup
servers, network monitoring, billing/CRM et cetera. IPng Networks is no different.
Somewhere between the sacred silence and sleep, lives my little AS8298. It's a gnarly and toxic
place out there in the DFZ, how do you own disorder?
### Connectivity
{{< image float="right" src="/assets/mpls-core/backbone.svg" alt="Backbone" >}}
As a refresher, here's the current situation at IPng Networks:
**1. Site Local Connectivity**
Each switch gets what is called an IPng Site Local (or _ipng-sl_) interface. This is a /27 IPv4 and
a /64 IPv6 that is bound on a local VLAN on each switch on our private network. Remember: the links
_between_ sites are no longer switched, they are _routed_ and pass ethernet frames only using MPLS.
I can connect for example all of the fleet's hypervisors to this internal network using jumboframes
using **198.19.0.0/16** and **2001:678:d78:500::/56** which is not connected to the internet.
**2. Egress Connectivity**
There are three geographically diverse gateways that inject an _OSPF E1_ default route into the
Centec Site Local network, and they will provide NAT for IPv4 and IPv6 to the internet. This setup
allows all machines in the internal private network to reach the internet, using their closest
gateway. Failing over between gateways is fully automatic, when one is unavailable or down for
maintenance, the network will simply find the next-closest gateway.
**3. Ingress Connectivity**
Inbound traffic (from the internet to IPng Site Local) is held at the gateways. First of all, the
reserved IPv4 space **198.18.0.0/15** is a bogon and will not be routed on the public internet, but
our VPP routers in AS8298 do carry the route albeit with the well-known BGP _no-export_ community set,
so traffic could arrive at the gateway coming from our own network only. This is not true for IPv6,
because here our prefix is a part of the AS8298 IPv6 PI space, and traffic will be globally
routable. Even then, only very few prefixes are allowed to enter into the IPng Site Local private
network, nominally only our NOC prefixes, one or two external bastion hosts, and our own Wireguard
endpoints which are running on the gateways.
### Frontend Setup
One of my goals for the private network is IPv4 conservation. I decided to move our web-frontends to
be dual-homed: one network interface towards the internet using public IPv4 and IPv6 addresses, and
another network interface that finds backend servers in the IPng Site Local private network.
This way, I can have one NGINX instance (or a pool of them), terminate the HTTP/HTTPS connection
(there's an InfraSec joke about _SSL is inserted and removed here :)_), no matter how many websites,
domains, or physical webservers I want to use. Some SSL certificate providers allow for wildcards
(ie. `*.ipng.ch`), but I'm going to keep it relatively simple and use [[Let's
Encrypt](https://letsencrypt.org/)] which offers free certificates with a validity of three months.
#### Installing NGINX
First, I will install three _minimal_ VMs with Debian Bullseye on separate hypervisors (in
R&uuml;mlang `chrma0`, Plan-les-Ouates `chplo0` and Amsterdam `nlams1`), giving them each 4 CPUs,
a 16G blockdevice on the hypervisor's ZFS (which is automatically snapsotted and backed up offsite
using ZFS replication!), and 1GB of memory. These machines will be the IPng Frontend servers, and
handle all client traffic to our web properties. Their job is to forward that HTTP/HTTPS traffic
internally to webservers that are running entirely in the IPng Site Local (private) network.
I'll install a few tablestakes packages on them, taking `nginx0.chrma0` as an example:
```
pim@nginx0-chrma0:~$ sudo apt install nginx iptables ufw rsync
pim@nginx0-chrma0:~$ sudo ufw allow 80
pim@nginx0-chrma0:~$ sudo ufw allow 443
pim@nginx0-chrma0:~$ sudo ufw allow from 198.19.0.0/16
pim@nginx0-chrma0:~$ sudo ufw allow from 2001:678:d78:500::/56
pim@nginx0-chrma0:~$ sudo ufw enable
```
#### Installing Lego
Next, I'll install one more highly secured _minimal_ VM with Debian Bullseye, giving it 1 CPU, a 16G
blockdevice and 1GB of memory. This is where my Let's Encrypt SSL certificate store will live. This
machine does not need to be publicly available, so it will only get one interface, connected to the
IPng Site Local network, so it'll be using private IPs.
This virtual machine really is bare-bones, it only gets a firewall, rsync, and the _lego_ package.
It doesn't technically even need to run SSH, because I can log into serial console using the
hypervisor. Considering it's an internal-only server (not connected to the internet), but also
because I do believe in OpenSSH's track record of safety, I decide to leave SSH enabled:
```
pim@lego:~$ apt install ufw lego rsync
pim@lego:~$ sudo ufw allow 8080
pim@lego:~$ sudo ufw allow 22
pim@lego:~$ sudo ufw enable
```
Now that all four machines are set up and appropriately filtered (using a simple `ufw` Debian package):
* NGINX will allow port 80 and 443 for public facing web traffic, and is permissive for the IPng Site
Local network, to allow SSH for rsync and maintenance tasks
* LEGO will be entirely closed off, allowing access only from trusted sources for SSH, and to one
TCP port 8080 on which `HTTP-01` certificate challenges will be served.
I make a pre-flight check to make sure that jumbo frames are possible from the frontends into the
backend network.
```
pim@nginx0-nlams1:~$ traceroute lego
traceroute to lego (198.19.4.6), 30 hops max, 60 byte packets
1 msw0.nlams0.net.ipng.ch (198.19.4.97) 0.737 ms 0.958 ms 1.155 ms
2 msw0.defra0.net.ipng.ch (198.19.2.22) 6.414 ms 6.748 ms 7.089 ms
3 msw0.chrma0.net.ipng.ch (198.19.2.7) 12.147 ms 12.315 ms 12.401 ms
2 msw0.chbtl0.net.ipng.ch (198.19.2.0) 12.685 ms 12.429 ms 12.557 ms
3 lego.net.ipng.ch (198.19.4.7) 12.916 ms 12.864 ms 12.944 ms
pim@nginx0-nlams1:~$ ping -c 3 -6 -M do -s 8952 lego
PING lego(lego.net.ipng.ch (2001:678:d78:503::6)) 8952 data bytes
8960 bytes from lego.net.ipng.ch (2001:678:d78:503::7): icmp_seq=1 ttl=62 time=13.33 ms
8960 bytes from lego.net.ipng.ch (2001:678:d78:503::7): icmp_seq=2 ttl=62 time=13.52 ms
8960 bytes from lego.net.ipng.ch (2001:678:d78:503::7): icmp_seq=3 ttl=62 time=13.28 ms
--- lego ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 13.280/13.437/13.590/0.116 ms
pim@nginx0-nlams1:~$ ping -c 5 -3 -M do -s 8972 lego
PING (198.19.4.6) 8972(9000) bytes of data.
8980 bytes from lego.net.ipng.ch (198.19.4.7): icmp_seq=1 ttl=62 time=12.85 ms
8980 bytes from lego.net.ipng.ch (198.19.4.7): icmp_seq=2 ttl=62 time=12.82 ms
8980 bytes from lego.net.ipng.ch (198.19.4.7): icmp_seq=3 ttl=62 time=12.91 ms
--- ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 4007ms
rtt min/avg/max/mdev = 12.823/12.843/12.913/0.138 ms
```
A note on the size used: An IPv4 header is 20 bytes, an IPv6 header is 40 bytes, and an ICMP header
is 8 bytes. If the MTU defined on the network is 9000, then the size of the ping payload can be
9000-20-8=**8972** bytes for IPv4 and 9000-40-8=**8952** for IPv6 packets. Using jumboframes
internally is a small optimization for the benefit of the internal webservers - less packets/sec
means more throughput and performance in general. It's also cool :)
#### CSRs and ACME, oh my!
In the old days, (and indeed, still today in many cases!) operators would write a Certificate
Signing Request or _CSR_ with the pertinent information for their website, and the SSL authority would
then issue a certificate, send it to the operator via e-mail (or would you believe it, paper mail),
after which the webserver operator could install and use the cert.
Today, most SSL authorities and their customers use the Automatic Certificate Management Environment
or _ACME protocol_ which is described in [[RFC8555](https://www.rfc-editor.org/rfc/rfc8555)]. It
defines a way for certificate authorities to check the websites that they are asked to issue a
certificate for using so-called challenges. There are several challenge types to choose from, but
the one I'll be focusing on is called `HTTP-01`. These challenges are served from a well known
URI, unsurprisingly in the path `/.well-known/...`, as described in [[RFC5785](https://www.rfc-editor.org/rfc/rfc5785)].
{{< image width="200px" float="right" src="/assets/ipng-frontends/certbot.svg" alt="Certbot" >}}
***Certbot***: Usually when running a webserver with SSL enabled, folks will use the excellent
[[Certbot](https://certbot.eff.org/)] tool from the electronic frontier foundation. This tool is
really smart, and has plugins that can automatically take a webserver running common server
software like Apache, Nginx, HAProxy or Plesk, figure out how you configured the webserver (which
hostname, options, etc), request a certificate and rewrite your configuration. What I find a nice
touch is that it automatically installs certificate renewal using a crontab.
{{< image width="200px" float="right" src="/assets/ipng-frontends/lego-logo.min.svg" alt="LEGO" >}}
***LEGO***: A Lets Encrypt client and ACME library written in Go
[[ref](https://go-acme.github.io/lego/)] and it's super powerful, able to solve for multiple ACME
challenges, and tailored to work well with Let's Encrypt as a certificate authority. The `HTTP-01`
challenge works as follows: when an operator wants to prove that they own a given domain name, the
CA can challenge the client to host a mutually agreed upon random number at a random URL under their
webserver's `/.well-known/acme-challenge/` on port 80. The CA will send an HTTP GET to this random
URI and expect the number back in the response.
#### Shared SSL at Edge
Because I will be running multiple frontends in different locations, it's operationally tricky to serve
this `HTTP-01` challenge random number in a randomly named **file** on all three NGINX servers. But
while the LEGO client can write the challenge file directly into a file in the webroot of a server, it
can _also_ run as an HTTP **server** with the sole purpose of responding to the challenge.
{{< image width="500px" float="left" src="/assets/ipng-frontends/acme-flow.svg" alt="ACME Flow" >}}
This is a killer feature: if I point the `/.well-known/acme-challenge/` URI on all the NGINX servers
to the one LEGO instance running centrally, it no longer matters which of the NGINX servers Let's
Encrypt will try to use to solve the challenge - they will all serve the same thing! The LEGO client
will construct the challenge request, ask Let's Encrypt to send the challenge, and then serve the
response. The only thing left to do then is copy the resulting certificate to the frontends.
Let me demonstrate how this works, by taking an example based on four websites, none of which run on
servers that are reachable from the internet: [[go.ipng.ch](https://go.ipng.ch/)],
[[video.ipng.ch](https://video.ipng.ch/)], [[billing.ipng.ch](https://billing.ipng.ch/)] and
[[grafana.ipng.ch](https://grafana.ipng.ch/)]. These run on four separate virtual machines (or
docker containers), all within the IPng Site Local private network in **198.19.0.0/16** and
**2001:678:d78:500::/56** which aren't reachable from the internet.
Ready? Let's go!
```
lego@lego:~$ lego --path /etc/lego/ --http --http.port :8080 --email=noc@ipng.ch \
--domains=nginx0.ipng.ch --domains=grafana.ipng.ch --domains=go.ipng.ch \
--domains=video.ipng.ch --domains=billing.ipng.ch \
run
```
The flow of requests is as follows:
1. The _LEGO_ client contacts the Certificate Authority and requests validation for a list of the
cluster hostname `nginx0.ipng.ch` and the additional four domains. It asks the CA to perform an
`HTTP-01` challenge. The CA will share two random numbers with _LEGO_, which will start a
webserver on port 8080 and serve the URI `/.well-known/acme-challenge/$(NUMBER1)`.
1. The CA will now resolve the A/AAAA addresses for the domain (`grafana.ipng.ch`), which is a CNAME
for the cluster (`nginx0.ipng.ch`), which in turn has multiple A/AAAA pointing to the three machines
associated with it. Visit any one of the _NGINX servers_ on that negotiated URI, and they will forward
requests for `/.well-known/acme-challenge` internally back to the machine running LEGO on its port 8080.
1. The _LEGO_ client will know that it's going to be visited on the URI
`/.well-known/acme-challenge/$(NUMBER1)`, as it has negotiated that with the CA in step 1. When the
challenge request arrives, LEGO will know to respond using the contents as agreed upon in
`$(NUMBER2)`.
1. After validating that the response on the random URI contains the agreed upon random number, the
CA knows that the operator of the webserver is the same as the certificate requestor for the domain.
It issues a certificate to the _LEGO_ client, which stores it on its local filesystem.
1. The _LEGO_ machine finally distributes the private key and certificate to all NGINX machines, which
are now capable of serving SSL traffic under the given names.
This sequence is done for each of the domains (and indeed, any other domain I'd like to add), and in
the end a bundled certiicate with the common name `nginx0.ipng.ch` and the four additional alternate
names is issued and saved in the certificate store. Up until this point, NGINX has been operating in
**clear text**, that is to say the CA has issued the ACME challenge on port 80, and NGINX has
forwarded it internally to the machine running _LEGO_ on its port 8080 without using encryption.
Taking a look at the certificate that I'll install in the NGINX frontends (note: never share your
`.key` material, but `.crt` files are public knowledge):
```
lego@lego:~$ openssl x509 -noout -text -in /etc/lego/certificates/nginx0.ipng.ch.crt
...
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
03:db:3d:99:05:f8:c0:92:ec:6b:f6:27:f2:31:55:81:0d:10
Signature Algorithm: sha256WithRSAEncryption
Issuer: C = US, O = Let's Encrypt, CN = R3
Validity
Not Before: Mar 16 19:16:29 2023 GMT
Not After : Jun 14 19:16:28 2023 GMT
Subject: CN = nginx0.ipng.ch
...
X509v3 extensions:
X509v3 Subject Alternative Name:
DNS:billing.ipng.ch, DNS:go.ipng.ch, DNS:grafana.ipng.ch,
DNS:nginx0.ipng.ch, DNS:video.ipng.ch
```
While the amount of output of this certificate is considerable, I've highlighted the cool bits. The
_Subject_ (also called _Common Name_ or _CN_) of the cert is the first `--domains` entry, and the
alternate names are that one plus all other `--domains` given when calling _LEGO_ earlier. In other
words, this certificate is valid for all five DNS domain names. Sweet!
#### NGINX HTTP Configuration
I find it useful to think about the NGINX configuration in two parts: (1) the cleartext / non-ssl
parts on port 80, and (2) the website itself that lives behind SSL on port 443. So in order, here's
my configuration for the acme-challenge bits on port 80:
```
pim@nginx0-chrma0:~$ cat < EOF | tee /etc/nginx/conf.d/lego.inc
location /.well-known/acme-challenge/ {
auth_basic off;
proxy_intercept_errors on;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_pass http://lego.net.ipng.ch:8080;
break;
}
EOF
pim@nginx0-chrma0:~$ cat < EOF | tee /etc/nginx/sites-available/go.ipng.ch.conf
server {
listen [::]:80;
listen 0.0.0.0:80;
server_name go.ipng.ch go.net.ipng.ch go;
access_log /var/log/nginx/go.ipng.ch-access.log;
include "conf.d/lego.inc";
location / {
return 301 https://go.ipng.ch$request_uri;
}
}
EOF
```
The first file is an include-file that is shared between all websites I'll serve from this cluster.
Its purpose is to forward any requests that start with the well-known ACME challenge URI onto the
backend _LEGO_ virtual machine, without requiring any authorization. Then, the second snippet
defines a simple webserver on port 80 giving it a few names (the FQDN `go.ipng.ch` but also two
shorthands `go.net.ipng.ch` and `go`). Due to the include, the ACME challenge will be performed on
port 80. All other requests will be rewritten and returned as a redirect to
`https://go.ipng.ch/`. If you've ever wondered how folks are able to type http://go/foo and still
avoid certificate errors, here's a cool trick that accomplishes that.
Actually these two things are all that's needed to obtain the SSL cert from Let's Encrypt. I haven't
even started a webserver on port 443 yet! To recap:
* Listen ***only to*** `/.well-known/acme-challenge/` on port 80, and forward those requests to LEGO.
* Rewrite ***all other*** port-80 traffic to `https://go.ipng.ch/` to avoid serving any unencrypted
content.
#### NGINX HTTPS Configuration
Now that I have the SSL certificate in hand, I can start to write webserver configs to handle the SSL
parts. I'll include a few common options to make SSL as safe as it can be (borrowed from Certbot),
and then create the configs for the webserver itself:
```
pim@nginx0-chrma0:~$ cat < EOF | tee -a /etc/nginx/conf.d/options-ssl-nginx.inc
ssl_session_cache shared:le_nginx_SSL:10m;
ssl_session_timeout 1440m;
ssl_session_tickets off;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_prefer_server_ciphers off;
ssl_ciphers "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:
ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:
DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384";
EOF
pim@nginx0-chrma0:~$ cat < EOF | tee /etc/nginx/sites-available/go.ipng.ch.conf
server {
listen [::]:443 ssl http2;
listen 0.0.0.0:443 ssl http2;
ssl_certificate /etc/nginx/conf.d/nginx0.ipng.ch.crt;
ssl_certificate_key /etc/nginx/conf.d/nginx0.ipng.ch.key;
include /etc/nginx/conf.d/options-ssl-nginx.inc;
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.pem;
server_name go.ipng.ch;
access_log /var/log/nginx/go.ipng.ch-access.log upstream;
location /edit/ {
proxy_pass http://git.net.ipng.ch:5000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
satisfy any;
allow 198.19.0.0/16;
allow 194.1.163.0/24;
allow 2001:678:d78::/48;
deny all;
auth_basic "Go Edits";
auth_basic_user_file /etc/nginx/conf.d/go.ipng.ch-htpasswd;
}
location / {
proxy_pass http://git.net.ipng.ch:5000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
EOF
```
The certificate and SSL options are loaded first from `/etc/nginc/conf.d/nginx0.ipng.ch.{crt,key}`.
Next, I don't want folks on the internet to be able to create or edit/overwrite my go-links, so I'll
add an ACL on the URI starting with `/edit/`. Either you come from a trusted IPv4/IPv6 prefix, in
which case you can edit links at will, or alternatively you present a username and password that is
stored in the `go-ipng.ch-htpasswd` file (using the Debian package `apache2-utils`).
Finally, all other traffic is forwarded internally to the machine `git.net.ipng.ch` on port 5000, where
the go-link server is running as a Docker container. That server accepts requests from the IPv4 and IPv6
IPng Site Local addresses of all three NGINX frontends to its port 5000.
### Icing on the cake: Internal SSL
The go-links server I described above doesn't itself spreak SSL. It's meant to be frontended on the
same machine by an Apache or NGINX or HAProxy which handles the client en- and decryption, and
usually that frontend will be running on the same server, at which point I could just let it bind
`localhost:5000`. However, the astute observer will point out that the traffic on the IPng Site
Local network is cleartext. Now, I don't think that my go-links traffic poses a security or privacy
threat, but certainly other sites (like `billing.ipng.ch`) are more touchy, and as such require a
end to end encryption on the network.
In 2003, twenty years ago, a feature was added to TLS that allows the client to specify the hostname
it was expecting to connect to, in a feature called _Server Name Indication_ or _SNI_, described in
detail in [[RFC3546](https://www.rfc-editor.org/rfc/rfc3546)]:
> [TLS] does not provide a mechanism for a client to tell a server the name of the server it is
> contacting. It may be desirable for clients to provide this information to facilitate secure
> connections to servers that host multiple 'virtual' servers at a single underlying network
> address.
>
> In order to provide the server name, clients MAY include an extension of type "server_name" in
> the (extended) client hello.
Every modern webserver and -browser can utilize the _SNI_ extention when talking to eachother. NGINX can
be configured to pass traffic along to the internal webserver by re-encrypting it with a new SSL
connection. Considering the internal hostname will not necessarily be the same as the external website
hostname, I can use _SNI_ to force the NGINX->Billing connection to re-use the `billing.ipng.ch`
hostname:
```
server_name billing.ipng.ch;
...
location / {
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 60;
proxy_pass https://biling.net.ipng.ch:443;
proxy_ssl_name $host;
proxy_ssl_server_name on;
}
```
What happens here is the upstream server is hit on port 443 with hostname `billing.net.ipng.ch` but
the SNI value is set back to `$host` which is `billing.ipng.ch` (note, without the \*.net.ipng.ch
domain). The cool thing is, now the internal webserver can reuse the same certificate! I can use the
mechanism described here to obtain the bundled certificate, and then pass that key+cert along to the
billing machine, and serve it there using the same certificate files as the frontend NGINX.
### What's next
Of course, the mission to save IPv4 addresses is achieved - I can now run dozens of websites behind
these three IPv4 and IPv6 addresses, and security gets a little bit better too, as the webservers
themselves are tucked away in IPng Site Local and unreachable from the public internet.
This IPng Frontend design also helps with reliability and latency. I can put frontends in any
number of places, renumber them relatively easily (by adding or removing A/AAAA records to
`nginx0.ipng.ch` and otherwise CNAMEing all my websites to that cluster-name). If load becomes an
issue, NGINX has a bunch of features like caching, cookie-persistence, loadbalancing with health
checking (so I could use multiple backend webservers and round-robin over the healthy ones), and so
on. Our Mastodon server on [[ublog.tech](https://ublog.tech)] or our Peertube server on
[[video.ipng.ch](https://video.ipng.ch/)] can make use of many of these optimizations, but while I
do love engineering, I am also super lazy so I prefer not to prematurely over-optimize.
The main thing that's next is to automate a bit more of this. IPng Networks has an Ansible
controller, which I'd like to add maintenance of the NGINX and LEGO configuration. That would sort
of look like defining pool `nginx0` with hostnames A, B and C; and then having a playbook that
creates the virtual machine, installes and configures NGINX, and plumbs it through to the _LEGO_
machine. I can imagine running a specific playbook that ensures the certificates stay fresh in some
`CI/CD` (I have a drone runner alongside our [[Gitea](https://git.ipng.ch/)] server), or just add
something clever to a cronjob on the _LEGO_ machine that periodically runs `lego ... renew` and when
new certificates are issued, copy them out to the NGINX machines in the given cluster with rsync,
and reloading their configuration to pick up the new certs.
But considering Ansible is its whole own elaborate bundle of joy, I'll leave that for maybe another
article.

View File

@ -0,0 +1,344 @@
---
date: "2023-03-24T10:56:54Z"
title: 'Case Study: Let''s Encrypt DNS-01'
---
Last week I shared how IPng Networks deployed a loadbalanced frontend cluster of NGINX webservers
that have public IPv4 / IPv6 addresses, but talk to a bunch of internal webservers that are in a
private network which isn't directly connected to the internet, so called _IPng Site Local_
[[ref]({%post_url 2023-03-11-mpls-core %})] with addresses **198.19.0.0/16** and
**2001:678:d78:500::/56**.
I wrote in [[that article]({% post_url 2023-03-17-ipng-frontends %})] that IPng will be using
_ACME_ HTTP-01 validation, which asks the certificate authority, in this case Let's Encrypt, to
contact the webserver on a well-known URI for each domain that I'm requesting a certificate for.
Unsurprisingly, several folks reached out to me asking "well what about DNS-01", and one sentence
caught their eye:
> Some SSL certificate providers allow for wildcards (ie. `*.ipng.ch`), but I'm going to keep it
> relatively simple and use [[Let's Encrypt](https://letsencrypt.org/)] which offers free
> certificates with a validity of three months.
I could've seen this one coming! The sentence can be read to imply it doesn't, but **of course**
Let's Encrypt offers wildcard certificates. It just doesn't satisfy my _relatively simple_ qualifier
of the second part of the sentence ... So here I go, down the rabbit hole that is understanding
(for myself, and possibly for readers of this article), how the DNS-01 challenge works, in greater
detail. Hopefully after writing this (me) and reading this (you), we can all agree that I was
wrong, and that using DNS-01 ***is*** relatively simple after all.
## Overview
I've installed three frontend NGINX servers (running at Coloclue AS8283, IPng AS8298 and IP-Max
AS25091), and one LEGO certificate machine (running in the internal _IPng Site Local_ network).
In the [[previous article]({% post_url 2023-03-17-ipng-frontends %})], I described the setup and
the use of Let's Encrypt with HTTP-01 challenges. I'll skip that here.
#### HTTP-01 vs DNS-01
{{< image width="200px" float="right" src="/assets/ipng-frontends/lego-logo.min.svg" alt="LEGO" >}}
Today, most SSL authorities and their customers use the Automatic Certificate Management Environment
or _ACME protocol_ which is described in [[RFC8555](https://www.rfc-editor.org/rfc/rfc8555)]. It
defines a way for certificate authorities to check the websites that they are asked to issue a
certificate for using so-called challenges. One popular challenge is the so-called `HTTP-01`, in
which the certificate authority will visit a well-known URI on the website domain for which the
certificate is being requested, namely `/.well-known/acme-challenge/`, which described in
[[RFC5785](https://www.rfc-editor.org/rfc/rfc5785)]. The CA will expect the webserver to respond
with an agreed upon string of numbers at that location, in which case proof of ownership is
established and a certificate is issued.
In some situations, this `HTTP-01` challenge can be difficult to perform:
* If the webserver is not reachable from the internet, or not reachable from the Let's Encrypt
servers, for example if it is on an intranet, such as _IPng Site Local_ itself.
* If the operator would prefer a wildcard certificate, proving ownership of all possible
sub-domains is no longer feasible with `HTTP-01` but proving ownership of the parent domain is.
One possible solution for these cases is to use the ACME challenge `DNS-01`, which doesn't use the
webserver running on `go.ipng.ch` to prove ownership, but the _nameserver_ that serves `ipng.ch`
instead. The Let's Encrypt GO client [[ref](https://go-acme.github.io/lego/)] supports both
challenges types.
The flow of requests in a `DNS-01` challenge is as follows:
{{< image width="400px" float="right" src="/assets/ipng-frontends/acme-flow-dns01.svg" alt="ACME Flow DNS01" >}}
1. First, the _LEGO_ client registers itself with the ACME-DNS server running on `auth.ipng.ch`.
After successful registration, _LEGO_ is given a username, password, and access to one DNS
recordname $(RRNAME).
It is expected that the operator sets up a CNAME for a well-known record `_acme-challenge.ipng.ch`
which points to that `$(RRNAME).auth.ipng.ch`. This happens only once.
1. When a certificate is needed, the _LEGO_ client contacts the Certificate Authority and requests
validation for the hostname `go.ipng.ch`. The CA will will inform the client of a random
number $(RANDOM) that it expects to see in a a well-known TXT record for `_acme-challenge.ipng.ch`
(which is the CNAME set up previously).
1. The _LEGO_ client now uses the username and password it received in step 1, to update the TXT
record of its `$(RRNAME).auth.ipng.ch` record to contain the $(RANDOM) number it learned in step 2.
1. The CA will issue a TXT query for `_acme-challenge.ipng.ch`, which is a CNAME to
`$(RRNAME).auth.ipng.ch`, which ultimately responds to the TXT query with the $(RANDOM) number.
1. After validating that the response on the TXT records contains the agreed upon random number, the
CA knows that the operator of the nameserver is the same as the certificate requestor for the domain.
It issues a certificate to the _LEGO_ client, which stores it on its local filesystem.
1. Similar to any other challenge, the _LEGO_ machine can now distribute the private key and
certificate to all NGINX machines, which are now capable of serving SSL traffic under the given names.
One thing worth noting, is that the TXT query is for _domain_ names, not _hostnames_, in other
words, anything in the `ipng.ch` domain will solicit a query to `_acme-challenge.ipng.ch` by the
`DNS-01` challenge. It is for this reason, that the challenge allows for wildcard certificates,
which can greatly reduce operational complexity and the total number of certificates needed.
### ACME DNS
Originally, DNS providers were expected to give the ability for their clients to _directly_ update the
well-known `_acme-challenge` TXT record, and while many commercial providers allow for this, IPng
Networks runs just plain-old [[NSD](https://nlnetlabs.nl/projects/nsd/about)] as authoritative
nameservers (shown above as `nsd0`, `nsd1` and `nsd2`). So what todo? Luckily, it was quickly
understood by the community that if there is a lookup for TXT record of `_acme-challenge.ipng.ch`,
that it would be absolutely OK to make some form of DNS-symlink by means of a CNAME.
One really great solution that leverages this ability is written by Joona Hoikkala, called
[[ACME-DNS](https://github.com/joohoi/acme-dns)]. It's sole purpose is to allow for an API, served
over https, to register new clients, let those clients update their TXT record(s), and then serve
them out in DNS. It's meant to be a multi-tenant system, by which I mean one ACME-DNS instance can
host millions of domains from thousands of distinct users.
#### Installing
I noticed that ACME-DNS relies on features in relatively modern Go, and the standard version that
comes with Debian Bullseye is a tad old, so first I need to install Go v1.19 from backports, before
I can continue with the build of the binary:
```
lego@lego:~$ sudo apt -t bullseye-backports install golang
lego@lego:~/src$ git clone https://github.com/joohoi/acme-dns
lego@lego:~/src/acme-dns$ export GOPATH=/tmp/acme-dns
lego@lego:~/src/acme-dns$ go build
lego@lego:~/src/acme-dns$ sudo cp acme-dns /usr/local/bin/acme-dns
lego@lego:~/src/acme-dns$ cat << EOF | sudo tee /lib/systemd/system/acme-dns.service
[Unit]
Description=Limited DNS server with RESTful HTTP API to handle ACME DNS challenges easily and
securely
After=network.target
[Service]
User=lego
Group=lego
AmbientCapabilities=CAP_NET_BIND_SERVICE
WorkingDirectory=~
ExecStart=/usr/local/bin/acme-dns -c /home/lego/acme-dns/config.cfg
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
```
This authoritative nameserver will want to listen on UDP and TCP ports 53, for which it either needs to
run as root, or perhaps better, run as non-privileged user with the `CAP_NET_BIND_SERVICE`
capability. The only other difference with the provided unit file, is that I'll be running this as
the `lego` user, with a configuration file and working path in its home-directory.
#### Configuring
***Step 1. Delegate auth.ipng.ch***
The first thing I should do is configure the subdomain for ACME-DNS, which I decide will be hosted on
`auth.ipng.ch`. I assign it an NS, an A and a AAAA record, and then update the `ipng.ch` domain:
```
$ORIGIN ipng.ch.
$TTL 86400
@ IN SOA ns.paphosting.net. hostmaster.ipng.ch. ( 2023032401 28800 7200 604800 86400)
NS ns.paphosting.nl.
NS ns.paphosting.net.
NS ns.paphosting.eu.
; ACME DNS
auth NS auth.ipng.ch.
A 194.1.163.93
AAAA 2001:678:d78:3::93
```
This snippet will make a DNS delegation for sub-domain `auth.ipng.ch` to the server also called
`auth.ipng.ch` and because the downstream delegation is in the same domain, I need to provide _glue_
records, that tell clients who are querying for `auth.ipng.ch` where to find that nameserver. At
this point, any request for `*.auth.ipng.ch` will end up being forwarded to the authoritative
nameserver, which can be found at either 194.1.163.93 or 2001:678:d78:3::93.
***Step 2. Start ACME DNS***
After having built the acme-dns server and given it a suitable systemd unit file, and knowing that
it's going to be responsible for the sub-domain `auth.ipng.ch`, I give it the following straight
forward configuration file:
```
lego@lego:~$ mkdir ~/acme-dns/
lego@lego:~$ cat << EOF > acme-dns/config.cfg
[general]
listen = "[::]:53"
protocol = "both"
domain = "auth.ipng.ch"
nsname = "auth.ipng.ch"
nsadmin = "hostmaster.ipng.ch"
records = [
"auth.ipng.ch. NS auth.ipng.ch.",
"auth.ipng.ch. A 194.1.163.93",
"auth.ipng.ch. AAAA 2001:678:d78:3::93",
]
debug = false
[database]
engine = "sqlite3"
connection = "/home/lego/acme-dns/acme-dns.db"
[api]
ip = "[::]"
disable_registration = false
port = "443"
tls = "letsencrypt"
acme_cache_dir = "/home/lego/acme-dns/api-certs"
notification_email = "hostmaster+dns-auth@ipng.ch"
corsorigins = [ "*" ]
use_header = false
header_name = "X-Forwarded-For"
[logconfig]
loglevel = "debug"
logtype = "stdout"
logformat = "text"
EOF
lego@lego:~$ sudo systemctl enable acme-dns
lego@lego:~$ sudo systemctl start acme-dns
```
The first part of this tells the server how to construct the SOA record (domain, nsname and
nsadmin), and which records to put in the apex, nominally the NS/A/AAAA records that describe the
nameserver which is authoritative for the `auth.ipng.ch` domain. Then, the database part is where
user credentials will be stored, and the API portion shows how users will be able to interact with
the controlplane part of the service, notably registering new clients, and updating nameserver TXT
records for existing clients.
{{< image width="200px" float="right" src="/assets/ipng-frontends/turtles.png" alt="Turtles" >}}
Interestingly, the API is served on HTTPS port 443, and for that it needs, you guessed it, a
certificate! ACME-DNS eats its own dogfood, which I can appreciate: it will use `DNS-01` validation
to get a certificate for `auth.ipng.ch` _itself_, by serving the challenge for well known record
`_acme-challenge.auth.ipng.ch`, so it's turtles all the way down!
***Step 3. Register a new client***
Seeing as many public DNS providers allow programmatic setting of the contents of the zonefiles, for
them it's a matter of directly being driven by _LEGO_. But for me, running NSD, I am going to be using
the ACME DNS server to fulfill that purpose, so I have to configure it to do that for me.
In the explanation of `DNS-01` challenges above, you'll remember I made a mention of registering. Here's
a closer look at what that means:
```
lego@lego:~$ curl -s -X POST https://auth.ipng.ch/register | json_pp
{
"allowfrom" : [],
"fulldomain" : "76f88564-740b-4483-9bc0-86d1fb531e20.auth.ipng.ch",
"password" : "<redacted>",
"subdomain" : "76f88564-740b-4483-9bc0-86d1fb531e20",
"username" : "e4608fdf-9a69-4930-8cf1-57218738792d"
}
```
What happened here is that, using the HTTPS endpoint, I asked the ACME-DNS server to create for me an empty
DNS record, which it did on `76f88564-740b-4483-9bc0-86d1fb531e20.auth.ipng.ch`. Further, if I offer
the given username and password, I am able to update that record's value. Let's take a look:
```
lego@lego:~$ dig +short TXT 02e3acfc-bbca-46bb-9cee-8eab52c73c30.auth.ipng.ch
lego@lego:~$ curl -s -X POST -H "X-Api-User: 5f3591d1-0d13-4816-a329-7965a8639ab5" \
-H "X-Api-Key: <redacted>" \
-d '{"subdomain": "02e3acfc-bbca-46bb-9cee-8eab52c73c30", \
"txt": "___Hello_World_token_______________________"}' \
https://auth.ipng.ch/update
```
Numbers everywhere, but I learned a lot here! Notice how the first time I sent the `dig` request for
the `02e3acfc-bbca-46bb-9cee-8eab52c73c30.auth.ipng.ch` it did not respond anything (an empty
record). But then, using the username/password I could update the record with a 41 character
string, and I was informed of the `fulldomain` key there, which is the one that I should be
configuring in the domain(s) for which I want to get a certificate.
I configure it in the `ipng.ch` and `ipng.nl` domain as follows (taking `ipng.nl` as an example):
```
$ORIGIN ipng.nl.
$TTL 86400
@ IN SOA ns.paphosting.net. hostmaster.ipng.nl. ( 2023032401 28800 7200 604800 86400)
IN NS ns.paphosting.nl.
IN NS ns.paphosting.net.
IN NS ns.paphosting.eu.
CAA 0 issue "letsencrypt.org"
CAA 0 issuewild "letsencrypt.org"
CAA 0 iodef "mailto:hostmaster@ipng.ch"
_acme-challenge CNAME 8ee2969b-571c-4b3a-b6a0-6d6221130c96.auth.ipng.ch.
```
The records here are a `CAA` which is a type of DNS record used to provide additional confirmation
for the Certificate Authority when validating an SSL certificate. This record allows me to specify
which certificate authorities are authorized to deliver SSL certificates for the domain. Then, the
well known `_acme-challenge.ipng.nl` record is merely telling the client by means of a `CNAME` to go
ask for `8ee2969b-571c-4b3a-b6a0-6d6221130c96.auth.ipng.ch` instead.
Putting this part all together now, I can issue a query for that ipng.nl domain ...
```
lego@lego:~$ dig +short TXT _acme-challenge.ipng.nl.
"___Hello_World_token_______________________"
```
... and would you look at that! The query for the ipng.nl domain, is a CNAME to the specific uuid
record in the auth.ipng.ch domain, where ACME-DNS is serving it with the response that I can
programmatically set to different values, yee-haw!
***Step 4. Run LEGO***
The _LEGO_ client has all sorts of challenge providers linked in. Once again, Debian is a bit behind
on things, shipping version 3.2.0-3.1+b5 in Bullseye, although upstream is much further along. So I
purge the Debian package and download the v4.10.2 amd64 package directly from its
[[Github](https://github.com/go-acme/lego/releases)] releases page. The ACME-DNS handler was only
added in v4 of the client. But now all that's left for me to do is run it:
```
lego@lego:~$ export ACME_DNS_API_BASE=https://auth.ipng.ch/
lego@lego:~$ export ACME_DNS_STORAGE_PATH=/home/lego/acme-dns/credentials.json
lego@lego:~$ /home/lego/bin/lego --path /etc/lego/ --email noc@ipng.ch --accept-tos --dns acme-dns \
--domains ipng.ch --domains *.ipng.ch \
--domains ipng.nl --domains *.ipng.nl \
run
```
The LEGO client goes through the ACME flow that I described at the top of this article, and ends up
spitting out a certificate \o/
```
lego@lego:~$ openssl x509 -noout -text -in /etc/lego/certificates/ipng.ch.crt
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
03:58:8f:c1:25:00:e2:f3:d3:3f:d6:ed:ba:bc:1d:0d:54:ea
Signature Algorithm: sha256WithRSAEncryption
Issuer: C = US, O = Let's Encrypt, CN = R3
Validity
Not Before: Mar 21 20:24:08 2023 GMT
Not After : Jun 19 20:24:07 2023 GMT
Subject: CN = ipng.ch
X509v3 extensions:
X509v3 Subject Alternative Name:
DNS:*.ipng.ch, DNS:*.ipng.nl, DNS:ipng.ch, DNS:ipng.nl
```
Et voila! Wildcard certificates for multiple domains using ACME-DNS.

View File

@ -0,0 +1,382 @@
---
date: "2023-04-09T11:01:14Z"
title: VPP - Monitoring
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
I've been working on the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)], which you
can read all about in my series on VPP back in 2021:
[![DENOG14](/assets/vpp-stats/denog14-thumbnail.png){: style="width:300px; float: right; margin-left: 1em;"}](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)
* [[Part 1]({% post_url 2021-08-12-vpp-1 %})]: Punting traffic through TUN/TAP interfaces into Linux
* [[Part 2]({% post_url 2021-08-13-vpp-2 %})]: Mirroring VPP interface configuration into Linux
* [[Part 3]({% post_url 2021-08-15-vpp-3 %})]: Automatically creating sub-interfaces in Linux
* [[Part 4]({% post_url 2021-08-25-vpp-4 %})]: Synchronize link state, MTU and addresses to Linux
* [[Part 5]({% post_url 2021-09-02-vpp-5 %})]: Netlink Listener, synchronizing state from Linux to VPP
* [[Part 6]({% post_url 2021-09-10-vpp-6 %})]: Observability with LibreNMS and VPP SNMP Agent
* [[Part 7]({% post_url 2021-09-21-vpp-7 %})]: Productionizing and reference Supermicro fleet at IPng
With this, I can make a regular server running Linux use VPP as kind of a software ASIC for super
fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links,
addresses and routes) itself. With Linux CP, running software like [FRR](https://frrouting.org/) or
[Bird](https://bird.network.cz/) on top of VPP and achieving &gt;150Mpps and &gt;180Gbps forwarding
rates are easily in reach. If you find that hard to believe, check out [[my DENOG14
talk](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)] or click the thumbnail above. I am
continuously surprised at the performance per watt, and the performance per Swiss Franc spent.
## Monitoring VPP
Of course, it's important to be able to see what routers are _doing_ in production. For the longest
time, the _de facto_ standard for monitoring in the networking industry has been Simple Network
Management Protocol (SNMP), described in [[RFC 1157](https://www.rfc-editor.org/rfc/rfc1157)]. But
there's another way, using a metrics and time series system called _Borgmon_, originally designed by
Google [[ref](https://sre.google/sre-book/practical-alerting/)] but popularized by Soundcloud in an
open source interpretation called **Prometheus** [[ref](https://prometheus.io/)]. IPng Networks ♥ Prometheus.
I'm a really huge fan of Prometheus and its graphical frontend Grafana, as you can see with my work on
Mastodon in [[this article]({% post_url 2022-11-27-mastodon-3 %})]. Join me on
[[ublog.tech](https://ublog.tech)] if you haven't joined the Fediverse yet. It's well monitored!
### SNMP
SNMP defines an extensible model by which parts of the OID (object identifier) tree can be delegated
to another process, and the main SNMP daemon will call out to it using an _AgentX_ protocol,
described in [[RFC 2741](https://datatracker.ietf.org/doc/html/rfc2741)]. In a nutshell, this
allows an external program to connect to the main SNMP daemon, register an interest in certain OIDs,
and get called whenever the SNMPd is being queried for them.
{{< image width="400px" float="right" src="/assets/vpp-stats/librenms.png" alt="LibreNMS" >}}
The flow is pretty simple (see section 6.2 of the RFC), the Agent (client):
1. opens a TCP or Unix domain socket to the SNMPd
1. sends an Open PDU, which the server will respond or reject.
1. (optionally) can send a Ping PDU, the server will respond.
1. registers an interest with Register PDU
It then waits and gets called by the SNMPd with Get PDUs (to retrieve one single value), GetNext PDU
(to enable snmpwalk), GetBulk PDU (to retrieve a whole subsection of the MIB), all of which are
answered by a Response PDU.
Using parts of a Python Agentx library written by GitHub user hosthvo
[[ref](https://github.com/hosthvo/pyagentx)], I tried my hands at writing one of these AgentX's.
The resulting source code is on [[GitHub](https://github.com/pimvanpelt/vpp-snmp-agent)]. That's the
one that's running in production ever since I started running VPP routers at IPng Networks AS8298.
After the _AgentX_ exposes the dataplane interfaces and their statistics into _SNMP_, an open source
monitoring tool such as LibreNMS [[ref](https://librenms.org/)] can discover the routers and draw
pretty graphs, as well as detect when interfaces go down, or are overloaded, and so on. That's
pretty slick.
### VPP Stats Segment in Go
But if I may offer some critique on my own approach, SNMP monitoring is _very_ 1990s. I'm
continously surpsied that our industry is still clinging on to this archaic approach. VPP offers
_a lot_ of observability, its statistics segment is chock full of interesting counters and gauges
that can be really helpful to understand how the dataplane performs. If there are errors or a
bottleneck develops in the router, going over `show runtime` or `show errors` can be a life saver.
Let's take another look at that Stats Segment (the one that the SNMP AgentX connects to in order to
query it for packets/byte counters and interface names).
You can think of the Stats Segment as a directory hierarchy where each file represents a type of
counter. VPP comes with a small helper tool called VPP Stats FS, which uses a FUSE based read-only
filesystem to expose those counters in an intuitive way, so let's take a look
```
pim@hippo:~/src/vpp/extras/vpp_stats_fs$ sudo systemctl start vpp
pim@hippo:~/src/vpp/extras/vpp_stats_fs$ sudo make start
pim@hippo:~/src/vpp/extras/vpp_stats_fs$ mount | grep stats
rawBridge on /run/vpp/stats_fs_dir type fuse.rawBridge (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
pim@hippo:/run/vpp/stats_fs_dir$ ls -la
drwxr-xr-x 0 root root 0 Apr 9 14:07 bfd
drwxr-xr-x 0 root root 0 Apr 9 14:07 buffer-pools
drwxr-xr-x 0 root root 0 Apr 9 14:07 err
drwxr-xr-x 0 root root 0 Apr 9 14:07 if
drwxr-xr-x 0 root root 0 Apr 9 14:07 interfaces
drwxr-xr-x 0 root root 0 Apr 9 14:07 mem
drwxr-xr-x 0 root root 0 Apr 9 14:07 net
drwxr-xr-x 0 root root 0 Apr 9 14:07 node
drwxr-xr-x 0 root root 0 Apr 9 14:07 nodes
drwxr-xr-x 0 root root 0 Apr 9 14:07 sys
pim@hippo:/run/vpp/stats_fs_dir$ cat sys/boottime
1681042046.00
pim@hippo:/run/vpp/stats_fs_dir$ date +%s
1681042058
pim@hippo:~/src/vpp/extras/vpp_stats_fs$ sudo make stop
```
There's lots of really interesting stuff in here - for example in the `/sys` hierarchy we can see a
`boottime` file, and from there I can determine the uptime of the process. Further, the `/mem`
hierarchy shows the current memory usage for each of the _main_, _api_ and _stats_ segment heaps.
And of course, in the `/interfaces` hierarchy we can see all the usual packets and bytes counters
for any interface created in the dataplane.
### VPP Stats Segment in C
I wish I were good at Go, but I never really took to the language. I'm pretty good at Python, but
sorting through the stats segment isn't super quick as I've already noticed in the Python3 based
[[VPP SNMP Agent](https://github.com/pimvanpelt/vpp-snmp-agent)]. I'm probably the world's least
terrible C programmer, so maybe I can take a look at the VPP Stats Client and make sense of it. Luckily,
there's an example already in `src/vpp/app/vpp_get_stats.c` and it reveals the following pattern:
1. assemble a vector of regular expression patterns in the hierarchy, or just `^/` to start
1. get a handle to the stats segment with `stats_segment_ls()` using the pattern(s)
1. use the handle to dump the stats segment into a vector with `stat_segment_dump()`.
1. iterate over the returned stats structure, each element has a type and a given name:
* ***STAT_DIR_TYPE_SCALAR_INDEX***: these are floating point doubles
* ***STAT_DIR_TYPE_COUNTER_VECTOR_SIMPLE***: single uint32 counter
* ***STAT_DIR_TYPE_COUNTER_VECTOR_COMBINED***: two uint32 counters
1. freeing the used stats structure with `stat_segment_data_free()`
The simple and combined stats turn out to be associative arrays, the outer of which notes the
_thread_ and the inner of which refers to the _index_. As such, a statistic of type
***VECTOR_SIMPLE*** can be decoded like so:
```
if (res[i].type == STAT_DIR_TYPE_COUNTER_VECTOR_SIMPLE)
for (k = 0; k < vec_len (res[i].simple_counter_vec); k++)
for (j = 0; j < vec_len (res[i].simple_counter_vec[k]); j++)
printf ("[%d @ %d]: %llu packets %s\n", j, k, res[i].simple_counter_vec[k][j], res[i].name);
```
The statistic of type ***VECTOR_COMBINED*** is very similar, except the union type there is a
`combined_counter_vec[k][j]` which has a member `.packets` and a member called `.bytes`. The
simplest form, ***SCALAR_INDEX***, is just a single floating point number attached to the name.
In principle, this should be really easy to sift through and decode. Now that I've figured that
out, let me dump a bunch of stats with the `vpp_get_stats` tool that comes with vanilla VPP:
```
pim@chrma0:~$ vpp_get_stats dump /interfaces/TenGig.*40121 | grep -v ': 0'
[0 @ 2]: 67057 packets /interfaces/TenGigabitEthernet81_0_0.40121/drops
[0 @ 2]: 76125287 packets /interfaces/TenGigabitEthernet81_0_0.40121/ip4
[0 @ 2]: 1793946 packets /interfaces/TenGigabitEthernet81_0_0.40121/ip6
[0 @ 2]: 77919629 packets, 66184628769 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx
[0 @ 0]: 7 packets, 610 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx
[0 @ 1]: 26687 packets, 18771919 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx
[0 @ 2]: 6448944 packets, 3663975508 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx
[0 @ 3]: 138924 packets, 20599785 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx
[0 @ 4]: 130720342 packets, 57436383614 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx
```
I can see both types of counter at play here, let me explain the first line: it is saying that the
counter of name `/interfaces/TenGigabitEthernet81_0_0.40121/drops`, at counter index 0, CPU thread
2, has a simple counter with value 67057. Taking the last line, this is a combined counter type with
name `/interfaces/TenGigabitEthernet81_0_0.40121/tx` at index 0, all five CPU threads (the main
thread and four worker threads) have all sent traffic into this interface, and the counters for each
in packets and bytes is given.
For readability's sake, my `grep -v` above doesn't print any counter that is 0. For example,
interface `Te81/0/0` has only one receive queue, and it's bound to thread 2. The other threads will
not receive any packets for it, consequently their `rx` counters stay zero:
```
pim@chrma0:~/src/vpp$ vpp_get_stats dump /interfaces/TenGig.*40121 | grep rx$
[0 @ 0]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx
[0 @ 1]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx
[0 @ 2]: 80720186 packets, 68458816253 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx
[0 @ 3]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx
[0 @ 4]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx
```
### Hierarchy: Pattern Matching
I quickly discover a pattern in most of these names: they start with a scope, say `/interfaces`,
then have a path entry for the interface name, and finally a specific counter (`/rx` or `/mpls`).
This is true also for the `/nodes` hiearchy, in which all VPP's graph nodes have a set of counters:
```
pim@chrma0:~$ vpp_get_stats dump /nodes/ip4-lookup | grep -v ': 0'
[0 @ 1]: 11365675493301 packets /nodes/ip4-lookup/clocks
[0 @ 2]: 3256664129799 packets /nodes/ip4-lookup/clocks
[0 @ 3]: 28364098623954 packets /nodes/ip4-lookup/clocks
[0 @ 4]: 30198798628761 packets /nodes/ip4-lookup/clocks
[0 @ 1]: 80870763789 packets /nodes/ip4-lookup/vectors
[0 @ 2]: 17392446654 packets /nodes/ip4-lookup/vectors
[0 @ 3]: 259363625369 packets /nodes/ip4-lookup/vectors
[0 @ 4]: 298176625181 packets /nodes/ip4-lookup/vectors
[0 @ 1]: 49730112811 packets /nodes/ip4-lookup/calls
[0 @ 2]: 13035172295 packets /nodes/ip4-lookup/calls
[0 @ 3]: 109088424231 packets /nodes/ip4-lookup/calls
[0 @ 4]: 119789874274 packets /nodes/ip4-lookup/calls
```
If you're ever seen the output of `show runtime`, it looks like this:
```
vpp# show runtime
Thread 1 vpp_wk_0 (lcore 28)
Time 3377500.2, 10 sec internal node vector rate 1.46 loops/sec 3301017.05
vector rates in 2.7440e6, out 2.7210e6, drop 3.6025e1, punt 7.2243e-5
Name State Calls Vectors Suspends Clocks Vectors/Call
...
ip4-lookup active 49732141978 80873724903 0 1.41e2 1.63
```
Hey look! On thread 1, which is called `vpp_wk_0` and is running on logical CPU core #28, there are
a bunch of VPP graph nodes that are all keeping stats of what they've been doing, and you can see
here that the following numbers line up between `show runtime` and the VPP Stats dumper:
* ***Name***: This is the name of the VPP graph node, in this case `ip4-lookup`, which is performing an
IPv4 FIB lookup to figure out what the L3 nexthop is of a given IPv4 packet we're trying to route.
* ***Calls***: How often did we invoke this graph node, 49.7 billion times so far.
* ***Vectors***: How many packets did we push through, 80.87 billion, humble brag.
* ***Clocks***: This one is a bit different -- you can see the cumulative clock cycles spent by
this CPU thread in the stats dump: 11365675493301 divided by 80870763789 packets is 140.54 CPU
cycles per packet. It's a cool interview question "How many CPU cycles does it take to do an
IPv4 routing table lookup". You now know the answer :-)
* ***Vectors/Call***: This is a measure of how busy the node is (did it run for only one packet,
or for many packets?). On average when the worker thread gave the `ip4-lookup` node some work to
do, there have been a total of 80873724903 packets handled in 49732141978 calls, so 1.626
packets per call. If ever you're handling 256 packets per call (the most VPP will allow per call),
your router will be sobbing.
### Prometheus Metrics
Prometheus has metrics which carry a name, and zero or more labels. The prometheus query language
can then use these labels to do aggregation, division, averages, and so on. As a practical example,
above I looked at interface stats and saw that the Rx/Tx numbers were counted one per thread. If
we'd like the total on the interface, it would be great if we could `sum without (thread,index)`,
which will have the effect of adding all of these numbers together.
For the monotonically increasing counter numbers (like the total vectors/calls/clocks per node), we
can take the running _rate_ of change, showing the time spent over the last minute, or so. This way,
spikes in traffic will clearly correlate both with a spike in packets/sec or bytes/sec on the
interface, but also a higher number of _vectors/call_, and correspondingly typically a lower number
of _clocks/vector_, as VPP gets more efficient when it can re-use the CPU's instruction and data
cache to do repeat work on multiple packets.
I decide to massage the statistic names a little bit, by transforming them of the basic format:
`prefix_suffix{label="X",index="A",thread="B"} value`
A few examples:
* The single counter that looks like `[6 @ 0]: 994403888 packets /mem/main heap` becomes:
* `mem{heap="main heap",index="6",thread="0"}`
* The combined counter `[0 @ 1]: 79582338270 packets, 16265349667188 bytes /interfaces/Te1_0_2/rx`
becomes:
* `interfaces_rx_packets{interface="Te1_0_2",index="0",thread="1"} 79582338270`
* `interfaces_rx_bytes{interface="Te1_0_2",index="0",thread="1"} 16265349667188`
* The node information running on, say thread 4, becomes:
* `nodes_clocks{node="ip4-lookup",index="0",thread="4"} 30198798628761`
* `nodes_vectors{node="ip4-lookup",index="0",thread="4"} 298176625181`
* `nodes_calls{node="ip4-lookup",index="0",thread="4"} 119789874274`
* `nodes_suspends{node="ip4-lookup",index="0",thread="4"} 0`
### VPP Exporter
I wish I had things like `split()` and `re.match()` but in C (well, I guess I do have POSIX regular
expressions...), but it's all a little bit more low level. Based on my basic loop that opens the
stats segment, registers its desired patterns, and then retrieves a vector of {name, type,
counter}-tuples, I decide to do a little bit of non-intrusive string tokenization first:
```
static int tokenize (const char *str, char delimiter, char **tokens, int *lengths) {
char *p = (char *) str;
char *savep = p;
int i = 0;
while (*p) if (*p == delimiter) {
tokens[i] = (char *) savep;
lengths[i] = (int) (p - savep);
i++; p++; savep = p;
} else p++;
tokens[i] = (char *) savep;
lengths[i] = (int) (p - savep);
return i++;
}
/* The call site */
char *tokens[10];
int lengths[10];
int num_tokens = tokenize (res[i].name, '/', tokens, lengths);
```
The tokenizer takes an array of N pointers to the resulting tokens, and their lengths. This sets it
aside from `strtok()` and friends, because those will overwrite the occurences of the delimiter in
the input string with `\0`, and as such cannot take a `const char *str` as input. This one leaves
the string alone though, and will return the tokens as {ptr, len}-tuples, including how many
tokens it found.
One thing I'll probably regret is that there's no bounds checking on the number of tokens -- if I
have more than 10 of these, I'll come to regret it. But for now, the depth of the hierarchy is only
3, so I should be fine. Besides, I got into a fight with ChatGPT after it declared a romantic
interest in my cat, so it won't write code for me anymore :-(
But using this simple tokenizer, and knowledge of the structure of well known hierarchy paths, the
rest of the exporter is quickly in hand. Some variables don't have a label (for example
`/sys/boottime`), but those that do will see that field transposed from the directory path
`/mem/main heap/free` into the label as I showed above.
### Results
{{< image width="400px" float="right" src="/assets/vpp-stats/grafana1.png" alt="Grafana 1" >}}
With this VPP Prometheus Exporter, I can now hook the VPP routers up to Prometheus and Grafana.
Aggregations in Grafana are super easy and scalable, due to the conversion of the static paths into
dynamically created labels on the prometheus metric names.
Drawing a graph of the running time spent by each individual VPP graph node might look something
like this:
```
sum without (node)(rate(nodes_clocks[60s]))
/
sum without (node)(rate(nodes_vectors[60s]))
```
The plot to the right shows a system under a loadtest that ramps up from 0% to 100% of line rate,
and the traces are the cummulative time spent in each node (on a logarithmic scale). The top purple
line represents `dpdk-input`. When a VPP dataplane is idle, the worker threads will be repeatedly
polling DPDK to ask it if it has something to do, spending 100% of their time being told "there is
nothing for you to do". But, once load starts appearing, the other nodes start spending CPU time,
for example the chain of IPv4 forwarding is `ethernet-input`, `ip4-input`, `ip4-lookup`, followed by
`ip4-rewrite` and ultimately the packet is transmitted on some other interface. When the system is
lightly loaded, the `ethernet-input` node for example will spend 1100 or so CPU cycles per packet,
but when the machine is under higher load, the time spent will decrease to as low as 22 CPU cycles
per packet. This is true for almost all of the nodes - VPP gets relatively _more efficient_ under
load.
{{< image width="400px" float="right" src="/assets/vpp-stats/grafana2.png" alt="Grafana 2" >}}
Another cool graph that I won't be able to see when using only LibreNMS and SNMP polling, is how
busy the router is. In VPP, each dispatch of the worker loop will poll DPDK and dispatch the packets
through the directed graph of nodes that I showed above. But how many packets can be moved through
the graph per CPU? The largest number of packets that VPP will ever offer into a call of the nodes
is 256. Typically an unloaded machine will have an average number of Vectors/Call of around 1.00.
When the worker thread is loaded, it may sit at around 130-150 Vectors/Call. If it's saturated, it
will quickly shoot up to 256.
As a good approximation, Vectors/Call normalized to 100% will be an indication of how busy the
dataplane is. In the picture above, between 10:30 and 11:00 my test router was pushing about 180Gbps
of traffic, but with large packets so its total vectors/call was modest (roughly 35-40), which you
can see as all threads there are running in the ~25% load range. Then at 11:00 a few threads got
hotter, and one of them completely saturated, and the traffic being forwarded by the CPU thread was
suffering _packetlo_, even though the others were absolutely fine... forwarding 150Mpps on a 10 year
old Dell R720!
### What's Next
Together with the graph above, I can also see how many CPU cycles are spent in which
type of operation. For example, encapsulation of GENEVE or VxLAN is not _free_, although it's also
not every expensive. If I know how many CPU cycles are available (roughly the clock speed of the CPU
threads, in our case Xeon X1518 (2.2GHz) or Xeon E5-2683 v4 CPUs (3GHz), I can pretty accurately
calculate what a given mix of traffic and features is going to cost, and how many packets/sec our
routers at IPng will be able to forward. Spoiler alert: it's way more than currently needed. Our
supermicros can handle roughly 35Mpps each, and considering a regular mixture of internet traffic
(called _imix_) is about 3Mpps per 10G, I will have room to spare for the time being
This is super valuable information for folks running VPP in production.
I haven't put the finishing touches on the VPP Prometheus Exporter, for example there are no
commandline flags yet, it doesn't listen on any port other than 9482 (the same one that the toy
exporter in `src/vpp/app/vpp_prometheus_export.c` ships with
[[ref](https://github.com/prometheus/prometheus/wiki/Default-port-allocations)]). My grafana
dashboard is also not fully completed yet. I hope to get that done in April, and publish both the
exporter and the dashboard on GitHub. Stay tuned!

View File

@ -0,0 +1,749 @@
---
date: "2023-05-07T10:01:14Z"
title: VPP MPLS - Part 1
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
I've deployed an MPLS core for IPng Networks, which allows me to provide L2VPN services, and at the
same time keep an IPng Site Local network with IPv4 and IPv6 that is separate from the internet,
based on hardware/silicon based forwarding at line rate and high availability. You can read all
about my Centec MPLS shenanigans in [[this article]({% post_url 2023-03-11-mpls-core %})].
Ever since the release of the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)]
plugin in VPP, folks have asked "What about MPLS?" -- I have never really felt the need to go this
rabbit hole, because I figured that in this day and age, higher level IP protocols that do tunneling
are just as performant, and a little bit less of an 'art' to get right. For example, the Centec
switches I deployed perform VxLAN, GENEVE and GRE all at line rate in silicon. And in an earlier
article, I showed that the performance of VPP in these tunneling protocols is actually pretty good.
Take a look at my [[VPP L2 article]({% post_url 2022-01-12-vpp-l2 %})] for context.
You might ask yourself: _Then why bother?_ To which I would respond: if you have to ask that question,
clearly you don't know me :) This article will form a deep dive into MPLS as implemented by VPP. In
a later set of articles, I'll partner with the incomparable [@vifino](https://chaos.social/@vifino) who
is adding MPLS support to the Linux Controlplane plugin. After that, I do expect VPP to be able to act
as a fully fledged provider- and provider-edge MPLS router.
## Lab Setup
A while ago I created a [[VPP Lab]({% post_url 2022-10-14-lab-1 %})] which is pretty slick, I use it
all the time. Most of the time I find myself messing around on the hypervisor and adding namespaces
with interfaces in it, to pair up with the VPP interfaces. And I tcpdump a lot! It's time for me to
make an upgrade to the Lab -- take a look at this picture:
{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
There's quite a bit to unpack here, but it will be useful to know this layout as I'll be referring
to the components here throughout the rest of the article. Each **lab** now has seven virtual
machines:
1. **vppX-Y** are Debian Testing machines running a reasonably fresh VPP - they are daisychained
with the first one attaching to the headend called `lab.ipng.ch`, using its Gi10/0/0 interface, and
onwards to its eastbound neighbor vpp0-1 using its GI10/0/1 interface.
1. **hostX-Y** are two Debian machines which have their 4 network cards (enp16s0fX) connected each
to one VPP instance's Gi10/0/2 interface (for `host0-0`) or Gi10/0/3 (for `host0-1`). This way, I
can test all sorts of topologies with one router, two routers, or multiple routers.
1. **tapX-0** is a special virtual machine which receives a copy of every packet on the underlying
Open vSwitch network fabric.
***NOTE***: X is the 0-based lab number, and Y stands for the 0-based logical machine number, so `vpp1-3`
is the fourth VPP virtualmachine on the second lab.
### Detour 1: Open vSwitch
To explain this tap a little bit - let me first talk about the underlay. All seven of these machines
(and each their four network cards) are bound by the hypervisor into an Open vSwitch bridge called
`vpplan`. Then, I use two features to build this topology:
Firstly, each pair of interfaces will be added as an access port into individual VLANs. For
example, `vpp0-0.Gi10/0/1` connects with `vpp0-1.Gi10/0/0` in VLAN 20 (annotated in orange), and
`vpp0-0.Gi10/0/2` connects to `host0-0.enp16s0f0` in VLAN 30 (annotated in purple). You can see the
East-West traffic over the VPP backbone are in the 20s, the `host0-0` traffic northbound is in the
30s, and the `host0-1` traffic southbound is in the 40s. Finally, the whole Open vSwitch fabric is
connected to `lab.ipng.ch` using VLAN 10 and a physical network card on the hypervisor (annotated in
green). The `lab.ipng.ch` machine then has internet connectivity.
```
BR=vpplan
for p in $(ovs-vsctl list-ifaces $BR); do
ovs-vsctl set port $p vlan_mode=access
done
# Uplink (Green)
ovs-vsctl set port uplink tag=10 ## eno1.200 on the Hypervisor
ovs-vsctl set port vpp0-0-0 tag=10
# Backbone (Orange)
ovs-vsctl set port vpp0-0-1 tag=20
ovs-vsctl set port vpp0-1-0 tag=20
...
# Northbound (Purple)
ovs-vsctl set port vpp0-0-2 tag=30
ovs-vsctl set port host0-0-0 tag=30
...
# Southbound (Red)
...
ovs-vsctl set port vpp0-3-3 tag=43
ovs-vsctl set port host0-1-3 tag=43
```
**NOTE**: The KVM interface names such as `vppX-Y-Z` where X means the lab number (0 in this case --
IPng does have multiple labs so I can run experiments and lab environments independently and isolated),
Y is the machine number, and Z is the interface number on the machine (from [0..3]).
### Detour 2: Mirroring Traffic
Secondly, now that I have created a 29 port switch with 12 VLANs, I decide to create an OVS _mirror
port_, which can be used to make a copy of traffic going in- or out of (a list of) ports. This is a
super powerful feature, and it looks like this:
```
BR=vpplan
MIRROR=mirror-rx
ovs-vsctl set port tap0-0-0 vlan_mode=access
[ ovs-vsctl list mirror $MIRROR >/dev/null 2>&1 ] || \
ovs-vsctl --id=@m get mirror $MIRROR -- remove bridge $BR mirrors @m
ovs-vsctl --id=@m create mirror name=$MIRROR \
-- --id=@p get port tap0-0-0 \
-- add bridge $BR mirrors @m \
-- set mirror $MIRROR output-port=@p \
-- set mirror $MIRROR select_dst_port=[] \
-- set mirror $MIRROR select_src_port=[]
for iface in $(ovs-vsctl list-ports $BR); do
[[ $iface == tap* ]] && continue
ovs-vsctl add mirror $MIRROR select_dst_port $(ovs-vsctl get port $iface _uuid)
done
```
The first call sets up the OVS switchport called `tap0-0-0` (which is enp16s0f0 on the machine
`tap0-0`) as an access port. To allow for this script to be idempotent, the second line will look up
if the mirror exists and if so, delete it. Then, I (re)create a mirror port with a given name
(`mirror-rx`), add it to the bridge, make the mirror's output port become `tap0-0-0`, and finally
clear the selected source and destination ports (this is where the traffic is mirrored _from_). At
this point, I have an empty mirror. To give it something useful to do, I loop over all of the ports
in the `vpplan` bridge and add them to the mirror, if they are the _destination_ port (here I have
to specify the uuid of the interface, not its name). I will add all interfaces, except those
of the `tap0-0` machine itself, to avoid loops.
In the end, I create two of these, one called `mirror-rx` which is forwarded to `tap0-0-0`
(enp16s0f0) and the other called `mirror-tx` which is forwarded to `tap0-0-1` (enp16s0f1). I can use
tcpdump on either of these ports, to show all the traffic either going _ingress_ to any port on any
machine, or emitting _egress_ from any port on any machine, respectively.
## Preparing the LAB
I wrote a little bit about the automation I use to maintain a few reproducable lab environments in a
[[previous article]({% post_url 2022-10-14-lab-1 %})], so I'll only show the commands themselves here,
not the underlying systems. When the LAB boots up, it comes with a basic Linux CP configuration that
uses OSPF and OSPFv3 running in Bird2, to connect the `vpp0-0` through `vpp0-3` machines together (each
router's Gi10/0/0 port connects to the next router's Gi10/0/1 port). LAB0 is in use by
[@vifino](https://chaos.social/@vifino) at the moment, so I'll take the next one running on its own
hypervisor, called LAB1.
Each machine has an IPv4 and IPv6 loopback, so the LAB will come up with basic connectivity:
```
pim@lab:~/src/lab$ LAB=1 ./create
pim@lab:~/src/lab$ LAB=1 ./command pristine
pim@lab:~/src/lab$ LAB=1 ./command start && sleep 150
pim@lab:~/src/lab$ traceroute6 vpp1-3.lab.ipng.ch
traceroute to vpp1-3.lab.ipng.ch (2001:678:d78:211::3), 30 hops max, 24 byte packets
1 e0.vpp1-0.lab.ipng.ch (2001:678:d78:211::fffe) 2.0363 ms 2.0123 ms 2.0138 ms
2 e0.vpp1-1.lab.ipng.ch (2001:678:d78:211::1:11) 3.0969 ms 3.1261 ms 3.3413 ms
3 e0.vpp1-2.lab.ipng.ch (2001:678:d78:211::2:12) 6.4845 ms 6.3981 ms 6.5409 ms
4 vpp1-3.lab.ipng.ch (2001:678:d78:211::3) 7.4610 ms 7.5698 ms 7.6413 ms
```
## MPLS For Dummies
.. like me! MPLS stands for [[Multi Protocol Label
Switching](https://en.wikipedia.org/wiki/Multiprotocol_Label_Switching)]. Rather than looking at the
IPv4 or IPv6 header in the packet, and making the routing decision based on the destination address,
MPLS takes the whole packet and encpsulates it into a new datagram that carries a 20-bit number
(called the _label_), three bits to classify the traffic, one _S-bit_ to signal that this is the
last label in a stack of labels, and finally 8 bits of TTL.
In total, 32 bits are prepended to the whole IP packet, or Ethernet frame, or any other type of inner
datagram. This is why it's called _Multi Protocol_. The _S-bit_ allows routers to know if the
following data is the inner payload (S=1), or if the following 32 bits are another MPLS label (S=0).
This way, routers can add more than one labels into a _label stack_.
Forwarding decisions are made on the contents of this MPLS _label_, without the need to examine
the packet itself. Two significant benefits become obvious:
1. The inner data payload (ie. an IPv6 packet or an Ethernet frame) doesn't have to be
rewritten, no new checksum created, no TTL decremented. Any datagram can be stuffed into an MPLS
packet, the routing (or _packet switching_) entirely happens using only the MPLS headers.
2. Importantly, no source- or destination IP addresses have to be looked up in a possibly very
large ~1M large FIB tree to figure out the next hop. Rather than traversing a [[Radix
Trie](https://en.wikipedia.org/wiki/Radix_tree)] or other datastructure to find the next-hop, a
static [[Hash Table](https://en.wikipedia.org/wiki/Hash_table)] with literal integer MPLS labels
can be consulted. This greatly simplifies the computational complexity in transit.
***P-Router***: The simplest form of an MPLS router is a so-called Label-Switch-Router (_LSR_) which is synonymous
for Provider-Router (_P-Router_). This is the router that sits in the core of the network, and its
only purpose is to receive MPLS packets, look up what to do with them based on the _label_ value,
and then forward the packet onto the next router. Sometimes the router can (and will) rewrite the
label, in an operation called a SWAP, but it can also leave the label as it was (in other words, the
input label value can be the same as the outgoing label value). The logic kind of goes like
**MPLS In-Label** => **{ MPLS Out-Label, Out-Interface, Out-NextHop }**. It's this behavior
that explains the name _Label Switching_.
If you were to imagine plotting a path through the lab network from say `vpp1-0` on the one side,
through `vpp1-1` and `vpp1-2` on finally onwards to `vpp1-3`, each router would be receiving MPLS packets on one
interface, and emitting them on their way to the next router on another interface. That *path* of
*switching* operations on the *labels* of those MPLS packets thus forms a so-called _Label-Switched-Path
(LSP)_. These LSPs are fundamental building blocks of MPLS networks, as I'll demonstrate later.
***PE-Router***: Some routers have a less boring job to do - those that sit at the edge of an MPLS network, accept
customer traffic and do something useful with it. These are called Label-Edge-Router (_LER_) which
is often colloquially called a Provider-Edge-Router (_PE-Router_). These routers receive normal
packets (ethernet or IP or otherwise), and perform the encapsulation by adding MPLS labels to them
upon receipt (ingress, called PUSH), or removing the encapsulation (called POP) and finding the
inner payload, continuing to handle them as per normal. The logic for these can be much more
complicated, but you can imagine it goes something like **MPLS In-Label** => **{ Operation }**
where the operation may be "take the resulting datagram, assume it is an IPv4 packet, so look it up
in the IPv4 routing table" or "take the resulting datagram, assume it is an ethernet frame, and emit
it on a specific interface", and really any number of other "operations".
The cool thing about MPLS is that the type of operations are vendor-extensible. If two routers A and B
agree what label 1234 means _to them_, they can simply insert it at the top of the _labelstack_ say
{100,1234}, where the bottom one (the 100 label that all the _P-Routers_ see) carries the semantic
meaning of "switch this packet onto the destination _PE-router_", where that _PE-router_ can pop the
outer label, to reveal the 1234-label, which it can look up in its table to tell it what to do next
with the MPLS payload in any way it chooses - the _P-Routers_ don't have to understand the meaning
of label 1234, they don't have to use or inspect it at all!
### Step 0: End Host setup
{{< image src="/assets/vpp-mpls/LAB v2 (1).svg" alt="Lab Setup" >}}
For this lab, I'm going to boot up instance LAB1 with no changes (for posterity, using image
`vpp-proto-disk0@20230403-release`). As an aside, IPng Networks has several of these lab
environments, and while [@vifino](https://chaos.social/@vifino) is doing some development testing on
LAB0, I simply switch to LAB1 to let him work in peace.
With the MPLS concepts introduced, let me start by configuring `host1-0` and `host1-1` and giving them
an IPv4 loopback address, and a transit network to their routers `vpp1-0` and `vpp1-3` respectively:
```
root@host1-1:~# ip link set enp16s0f0 up mtu 1500
root@host1-1:~# ip addr add 192.0.2.2/31 dev enp16s0f0
root@host1-1:~# ip addr add 10.0.1.1/32 dev lo
root@host1-1:~# ip ro add 10.0.1.0/32 via 192.0.2.3
root@host1-0:~# ip link set enp16s0f3 up mtu 1500
root@host1-0:~# ip addr add 192.0.2.0/31 dev enp16s0f3
root@host1-0:~# ip addr add 10.0.1.0/32 dev lo
root@host1-0:~# ip ro add 10.0.1.1/32 via 192.0.2.1
root@host1-0:~# ping -I 10.0.1.0 10.0.1.1
```
At this point, I don't expect to see much, as I haven't configured VPP yet. But `host1-0` will start
ARPing for 192.0.2.1 on `enp16s0f3`, which is connected to `vpp1-3.e2`. Let me take a look on the
Open vSwitch mirror to confirm that:
```
root@tap1-0:~# tcpdump -vni enp16s0f0 vlan 33
12:41:27.174052 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
12:41:28.333901 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
12:41:29.517415 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
12:41:30.645418 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
```
Alright! I'm going to leave the ping running in the background, and I'll trace packets through the
network using the Open vSwitch mirror, as well as take a look at what VPP is doing with the packets
using its packet tracer.
### Step 1: PE Ingress
```
vpp1-3# set interface state GigabitEthernet10/0/2 up
vpp1-3# set interface ip address GigabitEthernet10/0/2 192.0.2.1/31
vpp1-3# mpls table add 0
vpp1-3# set interface mpls GigabitEthernet10/0/1 enable
vpp1-3# set interface mpls GigabitEthernet10/0/0 enable
vpp1-3# ip route add 10.0.1.1/32 via 192.168.11.10 GigabitEthernet10/0/0 out-labels 100
```
Now the ARP resolution succeeds, and I can see that `host1-0` starts sending ICMP packets towards the
loopback that I have configured on `host1-1`, and it's of course using the newly learned L2 adjacency
for 192.0.2.1 at 52:54:00:13:10:02 (which is `vpp1-3.e2`). But, take a look at what the VPP router
does next: due to the `ip route add ...` command, I've told it to reach 10.0.1.1 via a nexthop of
`vpp1-2.e1`, but it will PUSH a single MPLS label 100,S=1 and forward it out on its Gi10/0/0
interface:
```
root@tap1-0:~# tcpdump -eni enp16s0f0 vlan or mpls
12:45:56.551896 52:54:00:20:10:03 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33
p 0, ethertype ARP (0x0806), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
12:45:56.553311 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 46: vlan 33
p 0, ethertype ARP (0x0806), Reply 192.0.2.1 is-at 52:54:00:13:10:02, length 28
12:45:56.620924 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33
p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 38791, seq 184, length 64
12:45:56.621473 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22
p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 38791, seq 184, length 64
```
My MPLS journey on VPP has officially begun! The first exchange in the tcpdump (packets 1 and 2)
is the ARP resolution of 192.0.2.1 by `host1-0`, after which it knows where to send the ICMP echo
(packet 3, on VLAN33), which is then sent out by `vpp1-3` as MPLS to `vpp1-2` (packet 4, on VLAN22).
Let me show you what such a packet looks like from the point of view of VPP. It has a _packet
tracing_ function which shows how any individual packet traverses the graph of nodes through the
router. It's a lot of information, but as a VPP operator, let alone a developer, it's really
important skill to learn -- so off I go, capturing and tracing a handful of packets:
```
vpp1-3# trace add dpdk-input 10
vpp1-3# show trace
------------------- Start of thread 0 vpp_main -------------------
Packet 1
20:15:00:496109: dpdk-input
GigabitEthernet10/0/2 rx queue 0
buffer 0x4c44df: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
ext-hdr-valid
PKT MBUF: port 2, nb_segs 1, pkt_len 98
buf_len 2176, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x2ed13840
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
IP4: 52:54:00:20:10:03 -> 52:54:00:13:10:02
ICMP: 10.0.1.0 -> 10.0.1.1
tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN
fragment id 0x2706, flags DONT_FRAGMENT
ICMP echo_request checksum 0x3bd6 id 8399
20:15:00:496167: ethernet-input
frame: flags 0x1, hw-if-index 3, sw-if-index 3
IP4: 52:54:00:20:10:03 -> 52:54:00:13:10:02
20:15:00:496201: ip4-input
ICMP: 10.0.1.0 -> 10.0.1.1
tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN
fragment id 0x2706, flags DONT_FRAGMENT
ICMP echo_request checksum 0x3bd6 id 8399
20:15:00:496225: ip4-lookup
fib 0 dpo-idx 1 flow hash: 0x00000000
ICMP: 10.0.1.0 -> 10.0.1.1
tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN
fragment id 0x2706, flags DONT_FRAGMENT
ICMP echo_request checksum 0x3bd6 id 8399
20:15:00:496256: ip4-mpls-label-imposition-pipe
mpls-header:[100:64:0:eos]
20:15:00:496258: mpls-output
adj-idx 25 : mpls via 192.168.11.10 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254001210015254001310008847 flow hash: 0x00000000
20:15:00:496260: GigabitEthernet10/0/0-output
GigabitEthernet10/0/0 flags 0x0018000d
MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01
label 100 exp 0, s 1, ttl 64
20:15:00:496262: GigabitEthernet10/0/0-tx
GigabitEthernet10/0/0 tx queue 0
buffer 0x4c44df: current data -4, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
ext-hdr-valid
l2-hdr-offset 0 l3-hdr-offset 14
PKT MBUF: port 2, nb_segs 1, pkt_len 102
buf_len 2176, data_len 102, ol_flags 0x0, data_off 124, phys_addr 0x2ed13840
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01
label 100 exp 0, s 1, ttl 64
```
This packet has gone through a total of eight nodes, and the local timestamps are the uptime of VPP
when the packets were received. I'll try to explain them in turn:
1. ***dpdk-input***: The packet is initially received by from Gi10/0/2 receive queue 0. It was an
ethernet packet from 52:54:00:20:10:03 (`host1-0.enp16s0f3`) to 52:54:00:13:10:02 (`vpp1-3.e2`). Some
more information is gleaned here, notably that it was an ethernet frame, an L3 IPv4 and L4 ICMP
packet.
1. ***ethernet-input***: Since it was an ethernet frame, it was passed into this node. Here VPP
concludes that this is an IPv4 packet, because the ethertype is 0x0800.
1. ***ip4-input***: We know it's an IPv4 packet, and the layer4 information shows this is an ICMP
echo packet from 10.0.1.0 to 10.0.1.1 (configured on `host1-1.lo`). VPP now needs to figure out where
to route this packet.
1. ***ip4-lookup***: VPP takes a look at its FIB for 10.0.1.1 - note the information I specified
above (the `ip route add ...` on `vpp1-3`) - the next-hop here is 192.168.11.10 on Gi10/0/0 _but_ VPP
also sees that I'm intending to add an MPLS _out-label_ of 100.
1. ***ip4-mpls-label-inposition-pipe***: An MPLS packet header is prepended in front of the IPv4
packet, which will have only one label (100) and since it's the only label, it will set the S-bit
(end-of-stack) to 1, and the MPLS TTL initializes at 64.
1. ***mpls-output***: Now that the IPv4 packet is wrapped into an MPLS packet, VPP uses the rest
of the FIB entry (notably the next-hop 192.168.11.0 and the output interface Gi10/0/0) to find where
this thing is supposed to go.
1. ***Gi10/0/0-output***: VPP now prepares the packet to be sent out on Gi10/0/0 as an MPLS
ethernet type. It uses the L2FIB adjacency table to figure out that we'll be sending it from our mac
address 52:54:00:13:10:00 (`vpp1-3.e0`) to the next hop on 52:54:00:12:10:01 (`vpp1-2.e1`).
1. ***Gi10/0/0-tx***: VPP hands the fully formed packet with all necessary information back to
DPDK to marshall it on the wire.
Can you imagine this router can do such a thing at a rate of 18-20 Million packets per second,
linearly scaling up per added CPU thread? I learn something new every time I look at a packet trace,
I simply love this dataplane implementation!
### Step 2: P-routers
In Step 1 I've shown that `vpp1-3` did send the MPLS packet to `vpp1-2`, but I haven't configured
anything there yet, and because I didn't enable MPLS, each of these beautiful packets is brutally
sent to the bit-bucket (also called _dpo-drop_):
```
vpp1-2# show err
Count Node Reason Severity
132 mpls-input MPLS input packets decapsulated info
132 mpls-input MPLS not enabled error
```
The purpose of a _P-router_ is to switch labels along the _Label-Switched-Path_. So let's manually
create the following to tell this `vpp1-2` router what to do when it receives an MPLS frame with
label 100:
```
vpp1-2# mpls table add 0
vpp1-2# set interface mpls GigabitEthernet10/0/0 enable
vpp1-2# set interface mpls GigabitEthernet10/0/1 enable
vpp1-2# mpls local-label add 100 eos via 192.168.11.8 GigabitEthernet10/0/0 out-labels 100
```
Remember, above I explained that the _P-Router_ has a simple job? It really does! All I'm doing here
is telling VPP, that if it receives an MPLS packet on any MPLS-enabled interface (notably Gi10/0/1
from which it is currently receiving MPLS packets from `vpp1-3`), that it should send the MPLS packet
out on Gi10/0/0 to neighbor 192.168.11.8 after imposing label 100.
If I've done a good job, I should be able to see this packet traversing the P-Router in a packet
trace:
```
20:42:51:151144: dpdk-input
GigabitEthernet10/0/1 rx queue 0
buffer 0x4c7d8b: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
ext-hdr-valid
PKT MBUF: port 1, nb_segs 1, pkt_len 102
buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1d1f6340
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01
label 100 exp 0, s 1, ttl 64
20:42:51:151161: ethernet-input
frame: flags 0x1, hw-if-index 2, sw-if-index 2
MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01
20:42:51:151171: mpls-input
MPLS: next mpls-lookup[1] label 100 ttl 64 exp 0
20:42:51:151174: mpls-lookup
MPLS: next [6], lookup fib index 0, LB index 74 hash 0 label 100 eos 1
20:42:51:151177: mpls-label-imposition-pipe
mpls-header:[100:63:0:eos]
20:42:51:151179: mpls-output
adj-idx 28 : mpls via 192.168.11.8 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254001110015254001210008847 flow hash: 0x00000000
20:42:51:151181: GigabitEthernet10/0/0-output
GigabitEthernet10/0/0 flags 0x0018000d
MPLS: 52:54:00:12:10:00 -> 52:54:00:11:10:01
label 100 exp 0, s 1, ttl 63
20:42:51:151184: GigabitEthernet10/0/0-tx
GigabitEthernet10/0/0 tx queue 0
buffer 0x4c7d8b: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
ext-hdr-valid
l2-hdr-offset 0 l3-hdr-offset 14
PKT MBUF: port 1, nb_segs 1, pkt_len 102
buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1d1f6340
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
MPLS: 52:54:00:12:10:00 -> 52:54:00:11:10:01
label 100 exp 0, s 1, ttl 63
```
In order, the following nodes are traversed:
1. ***dpdk-input***: received the frame from the network interface Gi10/0/1
1. ***ethernet-input***: the frame was an ethernet frame, and VPP determines based on the
ethertype (0x8847) that it is an MPLS frame
1. ***mpls-input***: inspects the MPLS labelstack and sees the outermost label (the only one on
this frame) with a value of 100.
1. ***mpls-lookup***: looks up the MPLS FIB what to do with packets which are End-Of-Stack or
_EOS_ (ie. with the S-bit set to 1), and are labeled 100. At this point VPP could make a different
choice if there is 1 label (as in this case), or a stack of multiple labels (Not-End-of-Stack or
_NEOS_, ie. with the S-bit set to 0).
1. ***mpls-label-imposition-pipe***: reads from the FIB that the outer label needs to be SWAPd
to a new _out-label_ (also with value 100). Because it's the same label, this is a no-op. However,
since this router is forwarding the MPLS packet, it will decrement the TTL to 63.
1. ***mpls-output***: VPP then uses the rest of the FIB information to determine the L3 nexthop is
192.168.11.8 on Gi10/0/0.
1. ***Gi10/0/0-output***: uses the L2FIB adjacency table to determine that the L2 nexthop is MAC
address 52:54:00:11:10:01 (`vpp1-1.e1`). If there is no L2 adjacency, this would be a good time for
VPP to send an ARP request to resolve the IP-to-MAC and store it in the L2FIB.
1. ***Gi10/0/0-tx***: hands off the frame to DPDK for marshalling on the wire.
If you counted with me, you'll see that this flow in the _P-Router_ also has eight nodes. However,
while the IPv4 FIB can and will be north of one million entries in a longest-prefix match radix trie
(which is computationally expensive), the MPLS FIB contains far fewer entries which are organized as
a literal key lookup in a hash table; and as well compared to IPv4 routing, the packet that is being
transported does not have to get a decremented TTL which requires a recalculated IPv4 checksum. MPLS
switching is _much_ cheaper than IPv4 routing!
Now that our packets are switched from `vpp1-2` to `vpp1-1` (which is also a _P-Router_), I'll just
rinse and repeat there, using the L3 adjacency pointing at `vpp1-0.e1` (192.168.11.6 on Gi10/0/0):
```
vpp1-1# mpls table add 0
vpp1-1# set interface mpls GigabitEthernet10/0/0 enable
vpp1-1# set interface mpls GigabitEthernet10/0/1 enable
vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100
```
Did I do this correctly? One way to check is by taking a look at which packets are seen on the Open
vSwitch mirror ports:
```
root@tap1-0:~# tcpdump -eni enp16s0f0
13:42:47.724107 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33
p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64
13:42:47.724769 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22
p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64
13:42:47.725038 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21
p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64
13:42:47.726155 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20
p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64
```
Nice!! I confirm that the ICMP packet first travels over VLAN 33 (from `host1-0` to `vpp1-3`), and then
the MPLS packets travel from `vpp1-3`, through `vpp-1-2`, through `vpp1-1` and towards `vpp1-0` over VLAN
22, 21 and 20 respectively.
### Step 3: PE Egress
Seeing as I haven't done anything with `vpp1-0` yet, now the MPLS packets all get dropped there. But not
for much longer, as I'm now ready to tell `vpp1-0` what to do with those packets:
```
vpp1-0# mpls table add 0
vpp1-0# set interface mpls GigabitEthernet10/0/0 enable
vpp1-0# set interface mpls GigabitEthernet10/0/1 enable
vpp1-0# mpls local-label add 100 eos via ip4-lookup-in-table 0
vpp1-0# ip route add 10.0.1.1/32 via 192.0.2.2
```
The difference between the _P-Routers_ in Step 2 and this _PE-Router_, is the operation provided in
the MPLS FIB. When an MPLS packet with _label_ value 100 is received, instead of forwarding it into
another interface (which is what the _P-Router_ would do), I tell VPP here to unwrap the MPLS label,
and expect to find an IPv4 packet which I'm asking it to route by looking up an IPv4 next hop in the
(IPv4) FIB table 0.
All that's left for me to do is add a regular static route for 10.0.1.1/32 via 192.0.2.2 (which is
the address on interface `host1-1.enp16s0f3`). If my thinkingcap is still working, I should now see
packets emit from `vpp1-0` on Gi10/0/3:
```
vpp1-0# trace add dpdk-input 10
vpp1-0# show trace
21:34:39:370589: dpdk-input
GigabitEthernet10/0/1 rx queue 0
buffer 0x4c4a34: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
ext-hdr-valid
PKT MBUF: port 1, nb_segs 1, pkt_len 102
buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x2ff28d80
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
MPLS: 52:54:00:11:10:00 -> 52:54:00:10:10:01
label 100 exp 0, s 1, ttl 62
21:34:39:370672: ethernet-input
frame: flags 0x1, hw-if-index 2, sw-if-index 2
MPLS: 52:54:00:11:10:00 -> 52:54:00:10:10:01
21:34:39:370702: mpls-input
MPLS: next mpls-lookup[1] label 100 ttl 62 exp 0
21:34:39:370704: mpls-lookup
MPLS: next [6], lookup fib index 0, LB index 83 hash 0 label 100 eos 1
21:34:39:370706: ip4-mpls-label-disposition-pipe
rpf-id:-1 ip4, pipe
21:34:39:370708: lookup-ip4-dst
fib-index:0 addr:10.0.1.1 load-balance:82
21:34:39:370710: ip4-rewrite
tx_sw_if_index 4 dpo-idx 32 : ipv4 via 192.0.2.2 GigabitEthernet10/0/3: mtu:9000 next:9 flags:[] 5254002110005254001010030800 flow hash: 0x00000000
00000000: 5254002110005254001010030800450000543dec40003e01e8bc0a0001000a00
00000020: 01010800173d231c01a0fce65864000000009ce80b00000000001011
21:34:39:370735: GigabitEthernet10/0/3-output
GigabitEthernet10/0/3 flags 0x0418000d
IP4: 52:54:00:10:10:03 -> 52:54:00:21:10:00
ICMP: 10.0.1.0 -> 10.0.1.1
tos 0x00, ttl 62, length 84, checksum 0xe8bc dscp CS0 ecn NON_ECN
fragment id 0x3dec, flags DONT_FRAGMENT
ICMP echo_request checksum 0x173d id 8988
21:34:39:370739: GigabitEthernet10/0/3-tx
GigabitEthernet10/0/3 tx queue 0
buffer 0x4c4a34: current data 4, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
ext-hdr-valid
l2-hdr-offset 0 l3-hdr-offset 14 loop-counter 1
PKT MBUF: port 1, nb_segs 1, pkt_len 98
buf_len 2176, data_len 98, ol_flags 0x0, data_off 132, phys_addr 0x2ff28d80
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
IP4: 52:54:00:10:10:03 -> 52:54:00:21:10:00
ICMP: 10.0.1.0 -> 10.0.1.1
tos 0x00, ttl 62, length 84, checksum 0xe8bc dscp CS0 ecn NON_ECN
fragment id 0x3dec, flags DONT_FRAGMENT
ICMP echo_request checksum 0x173d id 8988
```
Alright, another one of those huge blobs of information about a single packet traversing the VPP
dataplane, but it's the last one for this article, I promise! In order:
1. ***dpdk-input***: DPDK reads the frame which is arriving from `vpp1-1` on Gi10/0/1, it determines
that this is an ethernet frame
1. ***ethernet-input***: Based on the ethertype 0x8447, it knows that this ethernet frame is an
MPLS packet
1. ***mpls-input***: The MPLS _labelstack_ has one label, value 100, with (obviously) the
EndOfStack _S-bit_ set to 1; I can also see the (MPLS) TTL is 62 here, because it has traversed three
routers (`vpp1-3` TTL=64, `vpp1-2` TTL=63, and `vpp1-1` TTL=62)
1. ***mpls-lookup***: The lookup of local _label_ 100 informs VPP that it should switch to IPv4
processing and handle the packet as such
1. ***ip4-mpls-label-disposition-pipe***: The MPLS label is removed, revealing an IPv4 packet as
the inner payload of the MPLS datagram
1. ***lookup-ip4-dst***: VPP can now do a regular IPv4 forwarding table lookup for 10.0.1.1 which
informs it that it should forward the packet via 192.0.2.2 which is directly connected to Gi10/0/3.
1. ***ip4-rewrite***: The IPv4 TTL is decremented and the IP header checksum recomputed
1. ***Gi10/0/3-output***: VPP now can look up the L2FIB adjacency belonging to 192.0.2.2 on Gi10/0/3,
which informs it that 52:54:00:21:10:00 is the ethernet nexthop
1. ***Gi10/0/3-tx***: The packet is now handed off to DPDK to marshall on the wire, destined to
`host1-1.enp16s0f3`
That means I should be able to see it on `host1-1`, right? If you, too, are dying to know, check this out:
```
root@host1-1:~# tcpdump -ni enp16s0f0 icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:25:53.776486 IP 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8988, seq 1249, length 64
14:25:53.776522 IP 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 8988, seq 1249, length 64
14:25:54.799829 IP 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8988, seq 1250, length 64
14:25:54.799866 IP 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 8988, seq 1250, length 64
```
"Jiggle jiggle, wiggle wiggle!", as I do a premature congratulatory dance on the chair in my lab! I
created a _label-switched-path_ using VPP as MPLS provider-edge and provider routers, to move
this ICMP echo packet all the way from `host1-0` to `host1-1`, but there's absolutely nothing to
suggest that the resulting ICMP echo-reply can go to back from `host1-1` to `host1-0`, because
_LSPs_ are unidirectional. The final step for me to do is create an _LSP_ back in the other
direction:
```
vpp1-0# ip route add 10.0.1.0/32 via 192.168.11.7 GigabitEthernet10/0/1 out-labels 103
vpp1-1# mpls local-label add 103 eos via 192.168.11.9 GigabitEthernet10/0/1 out-labels 103
vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103
vpp1-3# mpls local-label add 103 eos via ip4-lookup-in-table 0
vpp1-3# ip route add 10.0.1.0/32 via 192.0.2.0
```
And with that, the ping I started at the beginning of this article, shoots to life:
```
root@host1-0:~# ping -I 10.0.1.0 10.0.1.1
PING 10.0.1.1 (10.0.1.1) from 10.0.1.0 : 56(84) bytes of data.
64 bytes from 10.0.1.1: icmp_seq=7644 ttl=62 time=6.28 ms
64 bytes from 10.0.1.1: icmp_seq=7645 ttl=62 time=7.45 ms
64 bytes from 10.0.1.1: icmp_seq=7646 ttl=62 time=7.01 ms
64 bytes from 10.0.1.1: icmp_seq=7647 ttl=62 time=5.76 ms
64 bytes from 10.0.1.1: icmp_seq=7648 ttl=62 time=5.88 ms
64 bytes from 10.0.1.1: icmp_seq=7649 ttl=62 time=9.23 ms
```
I will leave you with this packetdump from the Open vSwitch mirror, showing the entire flow of one
ICMP packet through the network:
```
root@tap1-0:~# tcpdump -c 10 -eni enp16s0f0
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:41:07.526861 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33
p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.528103 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22
p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.529342 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21
p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.530421 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20
p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.531160 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40
p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.531455 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40
p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.532245 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20
p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64)
10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.532732 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21
p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63)
10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.533923 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22
p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 62)
10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.535040 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33
p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
10 packets captured
10 packets received by filter
```
You can see all of the attributes I demonstrated in this article in one go: ingress ICMP packet on
VLAN 33, encapsulation with label 100, S=1 and ttl decrementing as the MPLS packet traverses
eastwards through the string of VPP routers on VLANs 22, 21 and 20, ultimately being sent out on
VLAN 40. There, the ICMP echo request packet is responded to, and we can trace the ICMP response as
it makes its way back westwards through the MPLS network using label 103, ultimately re-appearing on
VLAN 33.
There you have it. This is a fun story on _Multi Protocol Label Switching (MPLS)_ bringing a packet from
a _Label-Edge-Router (LER)_ through several _Label-Switch-Routers (LSRs)_ over a staticlly
configured _Label-Switched-Path (LSP)_. I feel like I can now more confidently use these terms
without sounding silly.
## What's next
The first mission is accomplished. I've taken a good look at IPv4 forwarding in the VPP dataplane as
MPLS packets, thereby en- and decapsulating the traffic using _PE-Routers_ and forwarding the
traffic using intermediary _P-Routers_. MPLS switching is cheaper than IPv4/IPv6 routing, but it can
also open a bunch of possibilities regarding advanced services offering, such as my coveted _Martini
Tunnels_ which transport ethernet frames point-to-point over an MPLS backbone. That will be the topic
of an upcoming article, as will I join forces with [@vifino](https://chaos.social/@vifino) who is adding
Linux Controlplane functionality to program the MPLS FIB using Netlink -- such that things like 'ip'
and 'FRR' can discover and share label information using a Label Distribution Protocol or _LDP_.

View File

@ -0,0 +1,635 @@
---
date: "2023-05-17T10:01:14Z"
title: VPP MPLS - Part 2
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
I've deployed an MPLS core for IPng Networks, which allows me to provide L2VPN services, and at the
same time keep an IPng Site Local network with IPv4 and IPv6 that is separate from the internet,
based on hardware/silicon based forwarding at line rate and high availability. You can read all
about my Centec MPLS shenanigans in [[this article]({% post_url 2023-03-11-mpls-core %})].
In the last article, I explored VPP's MPLS implementation a little bit. All the while,
[@vifino](https://chaos.social/@vifino) has been tinkering with the Linux Control Plane and adding
MPLS support to it, and together we learned a lot about how VPP does MPLS forwarding and how it
sometimes differs to other implementations. During the process, we talked a bit about
_implicit-null_ and _explicit-null_. When my buddy Fred read the [[previous article]({% post_url
2023-05-07-vpp-mpls-1 %})], he also talked about a feature called _penultimate-hop-popping_ which
maybe deserves a bit more explanation. At the same time, I could not help but wonder what the
performance is of VPP as a _P-Router_ and _PE-Router_, compared to say IPv4 forwarding.
## Lab Setup: VMs
{{< image src="/assets/vpp-mpls/LAB v2 (1).svg" alt="Lab Setup" >}}
For this article, I'm going to boot up instance LAB1 with no changes (for posterity, using image
`vpp-proto-disk0@20230403-release`), and it will be in the same state it was at the end of my
previous [[MPLS article]({% post_url 2023-05-07-vpp-mpls-1 %})]. To recap, there are four routers
daisychained in a string, and they are called `vpp1-0` through `vpp1-3`. I've then connected a
Debian virtual machine on both sides of the string. `host1-0.enp16s0f3` connects to `vpp1-3.e2`
and `host1-1.enp16s0f0` connects to `vpp1-0.e3`. Finally, recall that all of the links between these
routers and hosts can be inspected with the machine `tap1-0` which is connected to a mirror port on
the underlying Open vSwitch fabric. I bound some RFC1918 addresses on `host1-0` and `host1-1` and
can ping between the machines, using the VPP routers as MPLS transport.
### MPLS: Simple LSP
In this mode, I can plumb two _label switched paths (LSPs)_, the first one westbound from `vpp1-3`
to `vpp1-0`, and it wraps the packet destined to 10.0.1.1 into an MPLS packet with a single label
100:
```
vpp1-3# ip route add 10.0.1.1/32 via 192.168.11.10 GigabitEthernet10/0/0 out-labels 100
vpp1-2# mpls local-label add 100 eos via 192.168.11.8 GigabitEthernet10/0/0 out-labels 100
vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100
vpp1-0# mpls local-label add 100 eos via ip4-lookup-in-table 0
vpp1-0# ip route add 10.0.1.1/32 via 192.0.2.2
```
The second is eastbound from `vpp1-0` to `vpp1-3`, and it is using MPLS label 103. Remember:
LSPs are unidirectional!
```
vpp1-0# ip route add 10.0.1.0/32 via 192.168.11.7 GigabitEthernet10/0/1 out-labels 103
vpp1-1# mpls local-label add 103 eos via 192.168.11.9 GigabitEthernet10/0/1 out-labels 103
vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103
vpp1-3# mpls local-label add 103 eos via ip4-lookup-in-table 0
vpp1-3# ip route add 10.0.1.0/32 via 192.0.2.0
```
With these two _LSPs_ established, the ICMP echo request and subsequent ICMP echo reply can be seen
traveling through the network entirely as MPLS:
```
root@tap1-0:~# tcpdump -c 10 -eni enp16s0f0
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:41:07.526861 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33
p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.528103 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22
p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.529342 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21
p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.530421 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20
p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.531160 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40
p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.531455 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40
p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.532245 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20
p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64)
10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.532732 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21
p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63)
10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.533923 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22
p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 62)
10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.535040 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33
p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
10 packets captured
10 packets received by filter
```
When `vpp1-0` receives the MPLS frame with label 100,S=1, it looks up in the FIB to figure out what
_operation_ to perform with this packet is to POP the label, revealing the inner payload, which it
must look in the IPv4 FIB, and forward as per normal. This is a bit more expensive than it could be,
and the folks who established MPLS protocols found a few clever ways to cut down on cost!
### MPLS: Wellknown Label Values
I didn't know this until I started tinkering with MPLS on VPP, and as an operator it's easy to
overlook these things. As it so turns out, there are a few MPLS label values that have a very specific
meaning. Taking a read on [[RFC3032](https://www.rfc-editor.org/rfc/rfc3032.html)], label values 0-15
are reserved and they each serve a specific purpose:
* ***Value 0***: IPv4 Explicit NULL Label
* ***Value 1***: Router Alert Label
* ***Value 2***: IPv6 Explicit NULL Label
* ***Value 3***: Implicit NULL Label
There's a few other label values, 4-15, and if you're curious you could take a look at the [[Iana
List](https://www.iana.org/assignments/mpls-label-values/mpls-label-values.xhtml)] for them. For my
purposes, though, I'm only going to look at these weird little _NULL_ labels. What do they do?
### MPLS: Explicit Null
RFC3032 discusses the IPv4 explicit NULL label, value 0 (and the IPv6 variant with value 2):
> This label value is only legal at the bottom of the label
> stack. It indicates that the label stack must be popped,
> and the forwarding of the packet must then be based on the
> IPv4 header.
What this means in practice is that we can allow MPLS _PE-Routers_ to take a little shortcut. If the
MPLS label in the last hop is just telling the router to POP the label and take a look in its IPv4
forwarding table, I can also set the label to 0 in the router just preceding it. This way, when the
last router sees label value 0, it knows already what to do, saving it one FIB lookup.
I can reconfigure both LSPs to make use of this feature, by changing the MPLS FIB entries on
`vpp1-1` that points the _LSP_ towards `vpp1-0`, removing what I configured before (`mpls
local-label del ...`) and replacing that with an out-label value of 0 (`mpls local-label add ...`):
```
vpp1-1# mpls local-label del 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100
vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 0
vpp1-2# mpls local-label del 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103
vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 0
```
Due to this, the last routers in the _LSP_ now already know what to do, so I can clean these up:
```
vpp1-0# mpls local-label del 100 eos via ip4-lookup-in-table 0
vpp1-3# mpls local-label del 103 eos via ip4-lookup-in-table 0
```
If I ping from `host1-0` to `host1-1` again, I can see a subtle but important difference in the
packets on the wire:
```
17:49:23.770119 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0,
ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64
17:49:23.770403 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64
17:49:23.771184 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64
17:49:23.772503 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 0, exp 0, [S], ttl 62)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64
17:49:23.773392 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0,
ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64
17:49:23.773602 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0,
ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64
17:49:23.774592 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64)
10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64
17:49:23.775804 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63)
10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64
17:49:23.776973 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 0, exp 0, [S], ttl 62)
10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64
17:49:23.778255 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0,
ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64
```
Did you spot it? :) If your eyes are spinning, don't worry! I have configured the routers `vpp1-1`
towards `vpp1-0` in vlan 20 to use _IPv4 Explicit NULL_ (label 0). You can spot it on the fourth
packet in the tcpdump above. On the way back, `vpp1-2` towards `vpp1-3` in vlan 22 also sets _IPv4
Explicit NULL_ for the echo-reply. But, I do notice that end to end, the packet is still traversing
the network entirely as MPLS packets. The optimization here is that `vpp1-0` knows that label value
0 at the end of the label-stack just means 'what follows is an IPv4 packet, route it.'.
### MPLS: Implicit Null
Did that really help that much? I think I can answer the question by loadtesting, but first let me
take a closer look at what RFC3032 has to say about the _Implicit NULL Label_:
> A value of 3 represents the "Implicit NULL Label". This
> is a label that an LSR may assign and distribute, but
> which never actually appears in the encapsulation. When
> an LSR would otherwise replace the label at the top of the
> stack with a new label, but the new label is "Implicit
> NULL", the LSR will pop the stack instead of doing the
> replacement. Although this value may never appear in the
> encapsulation, it needs to be specified in the Label
> Distribution Protocol, so a value is reserved.
{{< image width="200px" float="right" src="/assets/vpp-mpls/PHP-logo.svg" alt="PHP Logo" >}}
Oh, groovy! What this tells me is that I can take one further shortcut: if I set the label value 0
(_Explicit NULL IPv4_), or 2 (_Explicit NULL IPV6_), my last router in the chain will know to look
up the FIB entry automatically, saving one MPLS FIB lookup. But in this case, label value 3
(_Implicit NULL_) is telling the router to just unwrap the MPLS parts (it's looking at them anyway!)
and just forward the bare inner payload which is an IPv4 or IPv6 packet, directy onto the last
router. This is what all the real geeks call _Penultimate Hop Popping_ or PHP, none of that website
programming language rubbish!
Let me replace the FIB entries in the penultimate routers with this magic label value (3):
```
vpp1-1# mpls local-label del 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 0
vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 3
vpp1-2# mpls local-label del 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 0
vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 3
```
Now I would expect this _penultimate hop popping_ to yield an IPv4 packet between `vpp1-1` and
`vpp1-0` on the ICMP echo-request, and as well an IPv4 packet between `vpp1-2` and `vpp1-3` on the ICMP
echo-reply way back, and would you look at that:
```
17:45:35.783214 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0,
ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64
17:45:35.783879 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64
17:45:35.784222 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63)
10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64
17:45:35.785123 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 102: vlan 20, p 0,
ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64
17:45:35.785311 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0,
ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64
17:45:35.785533 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0,
ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64
17:45:35.786465 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64)
10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64
17:45:35.787354 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63)
10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64
17:45:35.787575 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 102: vlan 22, p 0,
ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64
17:45:35.788320 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0,
ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64
```
I can now see that the behavior has changed in a subtle way once again. Where before, there were
**three** MPLS packets all the way between `vpp1-3` through `vpp1-2` and `vpp1-1` onto `vpp1-0`, now
there are only **two** MPLS packets, and the last one (on the way out in VLAN 20, and on the way
back in VLAN 22), is just an IPv4 packet. PHP is slick!
## Loadtesting Setup: Bare Metal
{{< image width="350px" float="right" src="/assets/vpp-mpls/MPLS Lab.svg" alt="Bare Metal Setup" >}}
In 1997, an Internet Engineering Task Force (IETF) working group created standards to help fix the
issues of the time, mostly around internet traffic routing. MPLS was developed as an alternative to
multilayer switching and IP over asynchronous transfer mode (ATM). In the 90s, routers were
comparatively weak in terms of CPU, and things like _content addressable memory_ to facilitate
faster lookups, was incredibly expensive. Back then, every FIB lookup counted, so tricks like
_Penultimate Hop Popping_ really helped. But what about now? I'm reasonably confident that any
silicon based router would not mind to have one extra MPLS FIB operation, and equally would not mind
to unwrap the MPLS packet at the end. But, since these things exist, I thought it would be a fun
activity to see how much they would help in the VPP world, where just like in the old days, every
operation performed on a packet does cost valuable CPU cycles.
I can't really perform a loadtest on the virtual machines backed by Open vSwitch, while tightly
packing six machines on one hypervisor. That setup is made specifically to do functional testing and
development work. To do a proper loadtest, I will need bare metal. So, I grabbed three Supermicro
SYS-5018D-FN8T, which I'm running throughout [[AS8298]({% post_url 2021-02-27-network %})], as I
know their performance quite well. I'll take three of these, and daisychain them with TenGig ports.
This way, I can take a look at the cost of _P-Routers_ (which only SWAP MPLS labels and forward the
result), as well as _PE-Routers_ (which have to encapsulate, and sometimes decapsulate the IP or
Ethernet traffic).
These machines get a fresh Debian Bookworm install and VPP 23.06 without any plugins. It's weird for
me to run a VPP instance without Linux CP, but in this case I'm going completely vanilla, so I
disable all plugins and give each VPP machine one worker thread. The install follows my popular
[[VPP-7]({% post_url 2021-09-21-vpp-7 %})]. By the way did you know that you can just type the search query [VPP-7] directly into Google to find this article. Am I an influencer now? Jokes aside, I decide to call the bare metal machines _France_,
_Belgium_ and _Netherlands_. And because if it ain't dutch, it ain't much, the Netherlands machine
sits on top :)
### IPv4 forwarding performance
The way Cisco T-Rex works in its simplest stateless loadtesting mode, is that it reads a Scapy file,
for example `bench.py`, and it then generates a stream of traffic from its first port, through the
_device under test (DUT)_, and expects to see that traffic returned on its second port. In a
bidirectional mode, traffic is sent from `16.0.0.0/8` to `48.0.0.0/8` in one direction, and back
from `48.0.0.0/8` to `16.0.0.0/8` in the other.
OK so first things first, let me configure a basic skeleton, taking _Netherlands_ as an example:
```
netherlands# set interface ip address TenGigabitEthernet6/0/1 192.168.13.7/31
netherlands# set interface ip address TenGigabitEthernet6/0/1 2001:678:d78:230::2:2/112
netherlands# set interface state TenGigabitEthernet6/0/1 up
netherlands# ip route add 100.64.0.0/30 via 192.168.13.6
netherlands# ip route add 192.168.13.4/31 via 192.168.13.6
netherlands# set interface ip address TenGigabitEthernet6/0/0 100.64.1.2/30
netherlands# set interface state TenGigabitEthernet6/0/0 up
netherlands# ip nei TenGigabitEthernet6/0/0 100.64.1.1 9c:69:b4:61:ff:40 static
netherlands# ip route add 16.0.0.0/8 via 100.64.1.1
netherlands# ip route add 48.0.0.0/8 via 192.168.13.6
```
The _Belgium_ router just has static routes back and forth, and the _France_ router looks similar
except it has its static routes all pointing in the other direction, and of course it has different
/31 transit networks towards T-Rex and _Belgium_. The one thing that is a bit curious is the use of
a static ARP entry that allows the VPP routers to resolve the nexthop for T-Rex -- in the case
above, T-Rex is sourcing from 100.64.1.1/30 (which has MAC address 9c:69:b4:61:ff:40) and sending to
our 100.64.1.2 on Te6/0/0.
{{< image src="/assets/vpp-mpls/trex.png" alt="T-Rex Baseline" >}}
After fiddling around a little bit with `imix`, I do notice the machine is still keeping up
with one CPU thread in both directions (~6.5Mpps). So I switch to 64b packets and ram up traffic
until that one VPP worker thread is saturated, which is a around the 9.2Mpps mark, so I lower it
slightly to a cool 9Mpps. Note: this CPU can have 3 worker threads in production, so it can do roughly
27Mpps per router, which is way cool!
The machines are at this point all doing exactly the same: receive ethernet from DPDK, do an IPv4
lookup, rewrite the header, and emit the frame on another interface. I can see that clearly in the
runtime statistics, taking a look at _Belgium_ for example:
```
belgium# show run
Thread 1 vpp_wk_0 (lcore 1)
Time 7912.6, 10 sec internal node vector rate 207.47 loops/sec 20604.47
vector rates in 8.9997e6, out 9.0054e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TenGigabitEthernet6/0/0-output active 172120948 35740749991 0 6.47e0 207.65
TenGigabitEthernet6/0/0-tx active 171687877 35650752635 0 8.49e1 207.65
TenGigabitEthernet6/0/1-output active 172119849 35740963315 0 7.79e0 207.65
TenGigabitEthernet6/0/1-tx active 171471125 35605967085 0 8.48e1 207.65
dpdk-input polling 171588827 71211720238 0 4.87e1 415.01
ethernet-input active 344675998 71571710136 0 2.16e1 207.65
ip4-input-no-checksum active 343340278 71751697912 0 1.86e1 208.98
ip4-load-balance active 342929714 71661706997 0 1.44e1 208.97
ip4-lookup active 341632798 71391716172 0 2.28e1 208.97
ip4-rewrite active 342498637 71571712383 0 2.59e1 208.97
```
Looking at the time spent for one individual packet, it's about 245 CPU cycles, and considering the
cores on this Xeon D1518 run at 2.2GHz, that checks out very accurately: 2.2e9 / 245 = 9Mpps! Every
time that DPDK is asked for some work, it yields on average a vector of 208 packets -- and this is
why VPP is so super fast: the first packet may need to page in the instructions belonging to one of
the graph nodes, but the second through 208th packet will find almost 100% hitrate in the CPU's
instruction cache. Who needs RAM anyway?
### MPLS forwarding performance
Now that I have a baseline, I can take a look at the difference between the IPv4 path and the MPLS
path, and here's where the routers will start to behave differently. _France_ and _Netherlands_ will
be _PE-Routers_ and handle encapsulation/decapsulation, while _Belgium_ has a comparatively easy
job, as it will only handle MPLS forwarding. I'll choose country-codes for the labels, that which is
destined to _France_ will have MPLS label 33,S=1; while that which goes to _Netherlands_ will have
MPLS label 31,S=1.
```
netherlands# ip ro del 48.0.0.0/8 via 192.168.13.6
netherlands# ip ro add 48.0.0.0/8 via 192.168.13.6 TenGigabitEthernet6/0/1 out-labels 33
netherlands# mpls local-label add 31 eos via ip4-lookup-in-table 0
belgium# ip route del 48.0.0.0/8 via 192.168.13.4
belgium# ip route del 16.0.0.0/8 via 192.168.13.7
belgium# mpls local-label add 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 33
belgium# mpls local-label add 31 eos via 192.168.13.7 TenGigabitEthernet6/0/0 out-labels 31
france# ip route del 16.0.0.0/8 via 192.168.13.5
france# ip route add 16.0.0.0/8 via 192.168.13.5 TenGigabitEthernet6/0/1 out-labels 31
france# mpls local-label add 33 eos via ip4-lookup-in-table 0
```
The types of _operation_ in MPLS is no longer symmetric. On the way in, the _PE-Router_ has to
encapsulate the IPv4 packet into an MPLS packet, and on the way out, the _PE-Router_ has to
decapsulate the MPLS packet to reveal the IPv4 packet. So, I change the loadtester to be
unidirectional, and ask it to send 10Mpps from _Netherlands_ to _France_. As soon as I reconfigure
the routers in this mode, I see quite a bit of _packetlo_, as only 7.3Mpps make it through.
Interesting! I wonder where this traffic is dropped, and what the bottleneck is, precisely.
#### MPLS: PE Ingress Performance
First, let's take a look at _Netherlands_, to try to understand why it is more expensive:
```
netherlands# show run
Time 255.5, 10 sec internal node vector rate 256.00 loops/sec 29399.92
vector rates in 7.6937e6, out 7.6937e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TenGigabitEthernet6/0/1-output active 7978541 2042505472 0 7.28e0 255.99
TenGigabitEthernet6/0/1-tx active 7678013 1965570304 0 8.25e1 255.99
dpdk-input polling 7684444 1965570304 0 4.55e1 255.79
ethernet-input active 7978549 2042507520 0 1.94e1 255.99
ip4-input-no-checksum active 7978557 2042509568 0 1.75e1 255.99
ip4-lookup active 7678013 1965570304 0 2.17e1 255.99
ip4-mpls-label-imposition-pipe active 7678013 1965570304 0 2.42e1 255.99
mpls-output active 7678013 1965570304 0 6.71e1 255.99
```
Each packet gets from `dpdk-input` into `ethernet-input`, the resulting IPv4 packet visits
`ip4-lookup` FIB where the MPLS out-label is found in the IPv4 FIB, the packet is then wrapped into
an MPLS packet in `ip4-mpls-label-imposition-pipe` and then sent through `mpls-output` to the NIC.
In total the input path (`ip4-*` plus `mpls-*`) takes **131 CPU cycles** for each packet. Including
all the nodes, from DPDK input to DPDK output sums up to 285 cycles, so 2.2GHz/285 = 7.69Mpps which
checks out.
#### MPLS: P Transit Performance
I would expect that _Belgium_ has it easier, as it's only doing label swapping and MPLS forwarding.
```
belgium# show run
Thread 1 vpp_wk_0 (lcore 1)
Time 595.6, 10 sec internal node vector rate 47.68 loops/sec 224464.40
vector rates in 7.6930e6, out 7.6930e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TenGigabitEthernet6/0/1-output active 97711093 4659109793 0 8.83e0 47.68
TenGigabitEthernet6/0/1-tx active 96096377 4582172229 0 8.14e1 47.68
dpdk-input polling 161102959 4582172278 0 5.72e1 28.44
ethernet-input active 97710991 4659111684 0 2.45e1 47.68
mpls-input active 97709468 4659096718 0 2.25e1 47.68
mpls-label-imposition-pipe active 99324916 4736048227 0 2.52e1 47.68
mpls-lookup active 99324903 4736045943 0 3.25e1 47.68
mpls-output active 97710989 4659111742 0 3.04e1 47.68
```
Indeed, _Belgium_ can still breathe, it's spending **110 Cycles per packet** doing the MPLS
switching (`mpls-*`), which is 18% less than the _PE-Router_ ingress. Judging by the _vectors/Call_
(last column), it's also running a bit cooler than the ingress router.
It's nice to see that the claim that _P-Routers_ are cheaper on the CPU can be verified to be true
in practice!
#### MPLS: PE Egress Performance
On to the last router, _France_, which is in charge of decapsulating the MPLS packet and doing the
resulting IPv4 lookup:
```
france# show run
Thread 1 vpp_wk_0 (lcore 1)
Time 1067.2, 10 sec internal node vector rate 256.00 loops/sec 27986.96
vector rates in 7.3234e6, out 7.3234e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TenGigabitEthernet6/0/0-output active 30528978 7815395072 0 6.59e0 255.99
TenGigabitEthernet6/0/0-tx active 30528978 7815395072 0 8.20e1 255.99
dpdk-input polling 30534880 7815395072 0 4.68e1 255.95
ethernet-input active 30528978 7815395072 0 1.97e1 255.99
ip4-load-balance active 30528978 7815395072 0 1.35e1 255.99
ip4-mpls-label-disposition-pip active 30528978 7815395072 0 2.82e1 255.99
ip4-rewrite active 30528978 7815395072 0 2.48e1 255.99
lookup-ip4-dst active 30815069 7888634368 0 3.09e1 255.99
mpls-input active 30528978 7815395072 0 1.86e1 255.99
mpls-lookup active 30528978 7815395072 0 2.85e1 255.99
```
This router is spending its time (in `*ip4*` and `mpls-*`) roughly at roughly **144.5 Cycles per
packet** and reveals itself as the bottleneck. _Netherlands_ sent _Belgium_ 7.69Mpps which it all
forwarded to _France_, where only 7.3Mpps make it through this _PE-Router_ egress, and into the
hands of T-Rex. In total, this router is spending 298 cycles/packet, which amounts to 7.37Mpps.
### MPLS Explicit Null performance
At the beginning of this article, I made a claim that we could take some shortcuts, and now is a
good time to see if those short cuts are worthwhile in the VPP setting. I'll reconfigure the
_Belgium_ router to set the _IPv4 Explicit NULL_ label (0), which can help my poor overloaded
_France_ router save some valuable CPU cycles.
```
belgium# mpls local-label del 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 33
belgium# mpls local-label add 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 0
```
The situation for _Belgium_ doesn't change at all, it's still doing the SWAP operation on the
incoming packet, but it's writing label 0,S=1 now (instead of label 33,S=1 before). But, haha!, take
a look at _France_ for an important difference:
```
france# show run
Thread 1 vpp_wk_0 (lcore 1)
Time 53.3, 10 sec internal node vector rate 85.35 loops/sec 77643.80
vector rates in 7.6933e6, out 7.6933e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TenGigabitEthernet6/0/0-output active 4773870 409847372 0 6.96e0 85.85
TenGigabitEthernet6/0/0-tx active 4773870 409847372 0 8.07e1 85.85
dpdk-input polling 4865704 409847372 0 5.01e1 84.23
ethernet-input active 4773870 409847372 0 2.15e1 85.85
ip4-load-balance active 4773869 409847235 0 1.51e1 85.85
ip4-rewrite active 4773870 409847372 0 2.60e1 85.85
lookup-ip4-dst-itf active 4773870 409847372 0 3.41e1 85.85
mpls-input active 4773870 409847372 0 1.99e1 85.85
mpls-lookup active 4773870 409847372 0 3.01e1 85.85
```
First off, I notice the _input_ vector rates match the _output_ vector rates, both at 7.69Mpps, and
that the average _Vectors/Call_ is no longer pegged at 256. The router is now spending **125 Cycles
per packet** which is a lot better than it was before (15.4% better than 144.5 Cycles/packet).
Conclusion: **MPLS Explicit NULL is cheaper**!
### MPLS Implicit Null (PHP) performance
So there's one mode of operation left for me to play with. What if we asked _Belgium_ to unwrap the
MPLS packet and forward it as an IPv4 packet towards _France_, in other words apply _Penultimate Hop
Popping_? Of course, the ingress _Netherlands_ won't change at all, but I reconfigure the _Belgium_
router, like so:
```
belgium# mpls local-label del 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 0
belgium# mpls local-label add 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 3
```
The situation in _Belgium_ now looks subtly different:
```
belgium# show run
Thread 1 vpp_wk_0 (lcore 1)
Time 171.1, 10 sec internal node vector rate 50.64 loops/sec 188552.87
vector rates in 7.6966e6, out 7.6966e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TenGigabitEthernet6/0/1-output active 26128425 1316828499 0 8.74e0 50.39
TenGigabitEthernet6/0/1-tx active 26128424 1316828327 0 8.16e1 50.39
dpdk-input polling 39339977 1316828499 0 5.58e1 33.47
ethernet-input active 26128425 1316828499 0 2.39e1 50.39
ip4-mpls-label-disposition-pip active 26128425 1316828499 0 3.07e1 50.39
ip4-rewrite active 27648864 1393790359 0 2.82e1 50.41
mpls-input active 26128425 1316828499 0 2.21e1 50.39
mpls-lookup active 26128422 1316828355 0 3.16e1 50.39
```
After doing the `mpls-lookup`, this router finds that it can just toss the label and forward the
packet as IPv4 down south. Cost for _Belgium_: **113 Cycles per packet**.
_France_ is now not participating in MPLS at all - it is simply receiving IPv4 packets which it has
to route back towards T-Rex. I take one final look at _France_ to see where it's spending its time:
```
france# show run
Thread 1 vpp_wk_0 (lcore 1)
Time 397.3, 10 sec internal node vector rate 42.17 loops/sec 259634.88
vector rates in 7.7112e6, out 7.6964e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TenGigabitEthernet6/0/0-output active 74381543 3211443520 0 9.47e0 43.18
TenGigabitEthernet6/0/0-tx active 70820630 3057504872 0 8.26e1 43.17
dpdk-input polling 131873061 3063377312 0 6.09e1 23.23
ethernet-input active 72645873 3134461107 0 2.66e1 43.15
ip4-input-no-checksum active 70820629 3057504812 0 2.68e1 43.17
ip4-load-balance active 72646140 3134473660 0 1.74e1 43.15
ip4-lookup active 70820628 3057504796 0 2.79e1 43.17
ip4-rewrite active 70820631 3057504924 0 2.96e1 43.17
```
As an IPv4 router, _France_ spends in total **102 Cycles per packet**. This matches very closely
with the 104 cycles/packet I found when doing my baseline loadtest with only IPv4 routing. I love it
when numbers align!!
### Scaling
One thing that I was curious to know, is if MPLS packets would allow for multiple receive queues, to
enable horizontal scaling by adding more VPP worker threads. The answer is a resounding YES! If I
restart the VPP routers _Netherlands_, _Belgium_ and _France_ with three workers and set DPDK
`num-rx-queues` to 3 as well, I see perfect linear scaling, in other words these little routers
would be able to forward roughly 27Mpps of MPLS packets with varying inner payloads (be it IPv4 or
IPv6 or Ethernet traffic with differing src/dest MAC addresses). All things said, IPv4 is still a
little bit cheaper on the CPU, at least on these routers with only a very small routing table. But,
it's great to see that MPLS forwarding can leverage RSS.
### Conclusions
This is all fine and dandy, but I think it's a bit trickier to see if PHP is actually cheaper or not.
To answer this question, I think I should count the total amount of CPU cycles spent end to end: for a
packet traveling from T-Rex coming into _Netherlands_, through _Belgium_ and _France_, and back out
to T-Rex.
| | Netherlands | Belgium | France | Total Cost |
| ------------------------- | ----------- | ----------- | ----------- | ------------ |
| Regular IPv4 path | 104 cycles | 104 cycles | 104 cycles | 312 cycles |
| MPLS: Simple LSP | 131 cycles | 110 cycles | 145 cycles | 386 cycles |
| MPLS: Explicit NULL LSP | 131 cycles | 110 cycles | 125 cycles | 366 cycles |
| MPLS: Penultimate Hop Pop | 131 cycles | 113 cycles | 102 cycles | ***346 cycles*** |
***Note***: The clock cycle numbers here are only `*mpls*` and `*ip4*` nodes, exclusing the `*-input`,
`*-output` and `*-tx` nodes, as they will add the same cost for all modes of operation.
I threw a lot of numbers into this article, and my head is spinning as I write this. But I still
think I can wrap it up in a way that allows me to have a few high level takeaways:
* IPv4 forwarding is a fair bit cheaper than MPLS forwarding (with an empty FIB, anyway). I had
not expected this!
* End to end, the MPLS bottleneck is in the _PE-Ingress_ operation.
* Explicit NULL helps without any drawbacks, as it cuts off one MPLS FIB lookup in the _PE-Egress_
operation.
* Implicit NULL (aka Penultimate Hop Popping) is also the fastest way to do MPLS with VPP, all
things considered.
## What's next
I joined forces with [@vifino](https://chaos.social/@vifino) who has effectively added MPLS handling
to the Linux Control Plane, so VPP can start to function as an MPLS router using FRR's label
distribution protocol implementation. Gosh, I wish Bird3 would have LDP :)
Our work is mostly complete, there's two pending Gerrit's which should be ready to review and
certainly ready to play with:
1. [[Gerrit 38826](https://gerrit.fd.io/r/c/vpp/+/38826)]: This adds the ability to listen to internal
state changes of an interface, so that the Linux Control Plane plugin can enable MPLS on the
_LIP_ interfaces and Linux sysctl for MPLS input.
1. [[Gerrit 38702](https://gerrit.fd.io/r/c/vpp/+/38702)]: This adds the ability to listen to Netlink
messages in the Linux Control Plane plugin, and sensibly apply these routes to the IPv4, IPv6
and MPLS FIB in the VPP dataplane.
If you'd like to test this - reach out to the VPP Developer mailinglist
[[ref](mailto:vpp-dev@lists.fd.io)] any time!

View File

@ -0,0 +1,463 @@
---
date: "2023-05-21T11:01:14Z"
title: VPP MPLS - Part 3
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
**Special Thanks**: Adrian _vifino_ Pistol for writing this code and for the wonderful collaboration!
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
In the [[first article]({%post_url 2023-05-07-vpp-mpls-1 %})] of this series, I took a look at MPLS
in general, and how setting up static _Label Switched Paths_ can be done in VPP. A few details on
special case labels (such as _Implicit Null_ which enabled the fabled _Penultimate Hop Popping_)
were missing, so I took a good look at them in the [[second article]({% post_url
2023-05-17-vpp-mpls-2 %})] of the series.
This was all just good fun but also allowed me to buy some time for
[@vifino](https://chaos.social/@vifino) who has been implementing MPLS handling within the Linux
Control Plane plugin for VPP! This final article in the series shows the engineering considerations
that went in to writing the plugin, which is currently under review but reasonably complete.
Considering the VPP 23.06 cutoff is next week, I'm not super hopeful that we'll be able to get a
full community / committer review in time, but at this point both @vifino and I think this code is
ready for consumption - considering FRR has a good _Label Distribution Protocol_ daemon, I'll switch
out of my usual habitat of Bird and install a LAB with FRR.
Caveat empor, outside of a modest functional and load-test, this MPLS functionality
hasn't seen a lot of mileage as it's only a few weeks old at this point, so it could definitely
contain some rough edges. Use at your own risk, but if you did want to discuss issues, the
[[vpp-dev@](mailto:vpp-dev@lists.fd.io)] mailinglist is a good first stop.
## Introduction
MPLS support is fairly complete in VPP already, but programming the dataplane would require custom
integrations, while using the Linux netlink subsystem feels easier from an end-user point of view.
This is a technical deep dive into the implementation of MPLS in the Linux Control Plane plugin for
VPP. If you haven't already, now is a good time to read up on the initial implementation of LCP:
* [[Part 1]({% post_url 2021-08-12-vpp-1 %})]: Punting traffic through TUN/TAP interfaces into Linux
* [[Part 2]({% post_url 2021-08-13-vpp-2 %})]: Mirroring VPP interface configuration into Linux
* [[Part 3]({% post_url 2021-08-15-vpp-3 %})]: Automatically creating sub-interfaces in Linux
* [[Part 4]({% post_url 2021-08-25-vpp-4 %})]: Synchronize link state, MTU and addresses to Linux
* [[Part 5]({% post_url 2021-09-02-vpp-5 %})]: Netlink Listener, synchronizing state from Linux to VPP
* [[Part 6]({% post_url 2021-09-10-vpp-6 %})]: Observability with LibreNMS and VPP SNMP Agent
* [[Part 7]({% post_url 2021-09-21-vpp-7 %})]: Productionizing and reference Supermicro fleet at IPng
To keep this writeup focused, I'll assume the anatomy of VPP plugins and the Linux Controlplane
_Interface_ and _Netlink_ plugins are understood. That way, I can focus on the _changes_ needed for
MPLS integration, which at first glance seem reasonably straight forward.
## VPP Linux-CP: Interfaces
First off, to enable any MPLS forwarding at all in VPP, I have to create the MPLS forwarding table
and enable MPLS on one or more interfaces:
```
vpp# mpls table add 0
vpp# lcp create GigabitEthernet10/0/0 host-if e0
vpp# set int mpls GigabitEthernet10/0/0 enable
```
What happens when the Gi10/0/0 interface has a _Linux Interface Pair (LIP)_ is that there exists a
corresponding TAP interface in the dataplane (typically called `tapX`) which in turn appears on the
Linux side as `e0`. Linux will want to be able to send MPLS datagrams into `e0`, and for that, two
things must happen:
1. Linux kernel must enable MPLS input on `e0`, typically with a sysctl.
1. VPP must enable MPLS on the TAP, in addition to the phy Gi10/0/0.
Therefore, the first order of business is to create a hook where the Linux CP interface plugin can
be made aware if MPLS is enabled or disabled in VPP - it turns out, such a callback function
_definition_ already exists, but it was never implemented. [[Gerrit
38826](https://gerrit.fd.io/r/c/vpp/+/38826)] adds a function `mpls_interface_state_change_add_callback()`,
which implements the ability to register a callback on MPLS on/off in VPP.
Now that the callback plumbing exists, Linux CP will want to register one of these, so that it can
set MPLS to the same enabled or disabled state on the Linux interface using
`/proc/sys/net/mpls/conf/${host-if}/input` (which is the moral equivalent of running `sysctl`), and
it'll also call `mpls_sw_interface_enable_disable()` on the TAP interface. With these changes both
implemented, enabling MPLS now looks like this in the logs:
```
linux-cp/mpls-sync: sync_state_cb: called for sw_if_index 1
linux-cp/mpls-sync: sync_state_cb: mpls enabled 1 parent itf-pair: [1] GigabitEthernet10/0/0 tap2 e0 97 type tap netns dataplane
linux-cp/mpls-sync: sync_state_cb: called for sw_if_index 8
linux-cp/mpls-sync: sync_state_cb: set mpls input for e0
```
Take a look at the code that implements enable/disable semantics in `src/plugins/linux-cp/lcp_mpls_sync.c`.
## VPP Linux-CP: Netlink
When Linux installs a route with MPLS labels, it will be seen in the return value of
`rtnl_route_nh_get_encap_mpls_dst()`. One or more labels can now be read using
`nl_addr_get_binary_addr()` yielding `struct mpls_label`, which contains the label value, experiment
bits and TTL, and these can be added to the route path in VPP by casting them to `struct
fib_mpls_label_t`. The last label in the stackwill have the S-bit set, so we can continue consuming these
until we find that condition. The first patchset that plays around with these semantics is
[[38702#2](https://gerrit.fd.io/r/c/vpp/+/38702/2)]. As you can see, MPLS is going to look very much
like IPv4 and IPv6 route updates in [[previous work]({%post_url 2021-09-02-vpp-5 %})], in that they
take the Netlink representation, rewrite them into VPP representation, and update the FIB.
Up until now, the Linux Controlplane netlink plugin understands only IPv4 and IPv6. So some
preparation work is called for:
* ***lcp_router_proto_k2f()*** gains the ability to cast Linux `AF_*` into VPP's `FIB_PROTOCOL_*`.
* ***lcp_router_route_mk_prefix()*** turns into a switch statement that creates a `fib_prefix_t`
for MPLS address family, in addition to the existing IPv4 and IPv6 types. It uses the non-EOS
type.
* ***lcp_router_mpls_nladdr_to_path()*** implements the loop that I described above, taking the
stack of `struct mpls_label` from Netlink and turning them into a vector of `fib_mpls_label_t`
for the VPP FIB.
* ***lcp_router_route_path_parse()*** becomes aware of MPLS SWAP and POP operations (the latter
being the case if there are 0 labels in the Netlink label stack)
* ***lcp_router_fib_route_path_dup()*** is a helper function to make a copy of a the FIB path
for the EOS and non-EOS VPP FIB inserts.
The VPP FIB differentiates between entries that are non-EOS (S=0), and can treat them differently to
those which are EOS (end of stack, S=1). Linux does not make this destinction, so it's safest to
just install non-EOS **and** EOS entries for each route from Linux. This is why
`lcp_router_fib_route_path_dup()` exists, otherwise Netlink route deletions for the MPLS routes
will yield a double free later on.
This prep work then allows for the following two main functions to become MPLS aware:
* ***lcp_router_route_add()*** when Linux sends a Netlink message about a new route, and that
route carries MPLS labels, make a copy of the path for the EOS entry and proceed to insert
both the non-EOS and newly crated EOS entries into the FIB,
* ***lcp_router_route_del()*** when Linux sends a Netlink message about a deleted route, we can
remove both the EOS and non-EOS variants of the route from VPP's FIB.
## VPP Linux-CP: MPLS with FRR
{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
I finally get to show off @vifino's lab! It's installed based off of a Debian Bookworm build, because
there's a few Netlink Library changes that haven't made their way into Debian Bullseye yet. The LAB
image is quickly built and distributed, and for this LAB I'm choosing specifically for
[[FRR](https://frrouting.org/)] because it ships with a _Label Distribution Protocol_ daemon out of
the box.
First order of business is to enable MPLS on the correct interfaces, and create the MPLS FIB table.
On each machine, I insert the following in the startup sequence:
```
ipng@vpp0-1:~$ cat << EOF | tee -a /etc/vpp/config/manual.vpp
mpls table add 0
set interface mpls GigabitEthernet10/0/0 enable
set interface mpls GigabitEthernet10/0/1 enable
EOF
```
The lab comes with OSPF and OSPFv3 enabled on each of the Gi10/0/0 and Gi10/0/1 interfaces that go
from East to West. This extra sequence enables MPLS on those interfaces, and because they have a
_Linux Interface Pair (LIP)_, VPP will enable MPLS on the internal TAP interfaces, as well as set
the Linux `sysctl` to allow the kernel to send MPLS encapsulated packets towards VPP.
Next up, turning on _LDP_ for FRR, which is easy enough:
```
ipng@vpp0-1:~$ vtysh
vpp0-2# conf t
vpp0-2(config)# mpls ldp
router-id 192.168.10.1
dual-stack cisco-interop
ordered-control
!
address-family ipv4
discovery transport-address 192.168.10.1
label local advertise explicit-null
interface e0
interface e1
exit-address-family
!
address-family ipv6
discovery transport-address 2001:678:d78:200::1
label local advertise explicit-null
ttl-security disable
interface e0
interface e1
exit-address-family
exit
```
I configure _LDP_ here to prefer advertising locally connected routes as _MPLS Explicit NULL_, which I
described in detail in the [[previous post]({% post_url 2023-05-17-vpp-mpls-2 %})]. It tells the
penultimate router to send the router a packet as MPLS with label value 0,S=1 for IPv4 and value 2,S=1
for IPv6, so that VPP knows imediately to decapsulate the packet and continue to IPv4/IPv6 forwarding.
An alternative here is setting implicit-null, which instructs the router before this one to perform
_Penultimate Hop Popping_. If this is confusing, take a look at that article for reference!
Otherwise, just giving each router a transport-address of a loopback interface, and a unique router-id,
the same as used in OSPF and OSPFv3, and we're off to the races. Just take a look at how easy this was:
```
vpp0-1# show mpls ldp discovery
AF ID Type Source Holdtime
ipv4 192.168.10.0 Link e0 15
ipv4 192.168.10.2 Link e1 15
ipv6 192.168.10.0 Link e0 15
ipv6 192.168.10.2 Link e1 15
vpp0-1# show mpls ldp neighbor
AF ID State Remote Address Uptime
ipv6 192.168.10.0 OPERATIONAL 2001:678:d78:200::
19:49:10
ipv6 192.168.10.2 OPERATIONAL 2001:678:d78:200::2
19:49:10
```
The first `show ... discovery` shows which interfaces are receiving multicast _LDP Hello Packets_,
and because I enabled discovery for both IPv4 and IPv6, I can see two pairs there. If I look at
which interfaces formed adjacencies, `show ... neighbor` reveals that LDP is preferring IPv6, and
that both adjacencies to `vpp0-0` and `vpp0-2` are operational. Awesome sauce!
I see _LDP_ neighbor adjacencies, so let me show you what label information was actually
exchanged, in three different places, **FRR**'s label distribution protocol daemon, **Linux**'s
IPv4, IPv6 and MPLS routing tables, and **VPP**'s dataplane forwarding information base.
### MPLS: FRR view
There are two things to note -- the IPv4 and IPv6 routing table, called a _Forwarding Equivalent Class
(FEC)_, and the MPLS forwarding table, called the _MPLS FIB_:
```
vpp0-1# show mpls ldp binding
AF Destination Nexthop Local Label Remote Label In Use
ipv4 192.168.10.0/32 192.168.10.0 20 exp-null yes
ipv4 192.168.10.1/32 0.0.0.0 exp-null - no
ipv4 192.168.10.2/32 192.168.10.2 16 exp-null yes
ipv4 192.168.10.3/32 192.168.10.2 33 33 yes
ipv4 192.168.10.4/31 192.168.10.0 21 exp-null yes
ipv4 192.168.10.6/31 192.168.10.0 exp-null exp-null no
ipv4 192.168.10.8/31 192.168.10.2 exp-null exp-null no
ipv4 192.168.10.10/31 192.168.10.2 17 exp-null yes
ipv6 2001:678:d78:200::/128 192.168.10.0 18 exp-null yes
ipv6 2001:678:d78:200::1/128 0.0.0.0 exp-null - no
ipv6 2001:678:d78:200::2/128 192.168.10.2 31 exp-null yes
ipv6 2001:678:d78:200::3/128 192.168.10.2 38 34 yes
ipv6 2001:678:d78:210::/60 0.0.0.0 48 - no
ipv6 2001:678:d78:210::/128 0.0.0.0 39 - no
ipv6 2001:678:d78:210::1/128 0.0.0.0 40 - no
ipv6 2001:678:d78:210::2/128 0.0.0.0 41 - no
ipv6 2001:678:d78:210::3/128 0.0.0.0 42 - no
vpp0-1# show mpls table
Inbound Label Type Nexthop Outbound Label
------------------------------------------------------------------
16 LDP 192.168.10.9 IPv4 Explicit Null
17 LDP 192.168.10.9 IPv4 Explicit Null
18 LDP fe80::5054:ff:fe00:1001 IPv6 Explicit Null
19 LDP fe80::5054:ff:fe00:1001 IPv6 Explicit Null
20 LDP 192.168.10.6 IPv4 Explicit Null
21 LDP 192.168.10.6 IPv4 Explicit Null
31 LDP fe80::5054:ff:fe02:1000 IPv6 Explicit Null
32 LDP fe80::5054:ff:fe02:1000 IPv6 Explicit Null
33 LDP 192.168.10.9 33
38 LDP fe80::5054:ff:fe02:1000 34
```
In the first table, each entry of the IPv4 and IPv6 routing table, as fed by OSPF and OSPFv3,
will get a label associated with them. The negotiation of _LDP_ will ask our peer to set a
specific label, and it'll inform the peer on which label we are intending to use for the
_Label Switched Path_ towards that destination. I'll give two examples to illustrate how
this table is used:
1. This router (`vpp0-1`) has a peer `vpp0-0` and when this router wants to send traffic to
it, it'll be sent with `exp-null` (because it is the last router in the _LSP_), but when
other routers might want to use this router to reach `vpp0-0`, they should use the MPLS
label value 20.
1. This router (`vpp0-1`) is _not_ directly connected to `vpp0-3` and as such, its IPv4 and IPv6
loopback addresses are going to contain labels in both directions: if `vpp0-1` itself
wants to send a packet to `vpp0-3`, it will use label value 33 and 38 respectively.
However, if other routers want to use this router to reach `vpp0-3`, they should use the
MPLS label value 33 and 34 respectively.
The second table describes the MPLS _Forwarding Information Base (FIB)_. When receiving an
MPLS packet with an inbound label noted in this table, the operation applied is _SWAP_ to the
outbound label, and forward towards a nexthop -- this is the stuff that _P-Routers_ use when
transiting MPLS traffic.
### MPLS: Linux view
FRR's LDP daemon will offer both of these routing tables to the Linux kernel using Netlink
messages, so the Linux view looks similar:
```
root@vpp0-1:~# ip ro
192.168.10.0 nhid 230 encap mpls 0 via 192.168.10.6 dev e0 proto ospf src 192.168.10.1 metric 20
192.168.10.2 nhid 226 encap mpls 0 via 192.168.10.9 dev e1 proto ospf src 192.168.10.1 metric 20
192.168.10.3 nhid 227 encap mpls 33 via 192.168.10.9 dev e1 proto ospf src 192.168.10.1 metric 20
192.168.10.4/31 nhid 230 encap mpls 0 via 192.168.10.6 dev e0 proto ospf src 192.168.10.1 metric 20
192.168.10.6/31 dev e0 proto kernel scope link src 192.168.10.7
192.168.10.8/31 dev e1 proto kernel scope link src 192.168.10.8
192.168.10.10/31 nhid 226 encap mpls 0 via 192.168.10.9 dev e1 proto ospf src 192.168.10.1 metric 20
root@vpp0-1:~# ip -6 ro
2001:678:d78:200:: nhid 231 encap mpls 2 via fe80::5054:ff:fe00:1001 dev e0 proto ospf src 2001:678:d78:200::1 metric 20 pref medium
2001:678:d78:200::1 dev loop0 proto kernel metric 256 pref medium
2001:678:d78:200::2 nhid 237 encap mpls 2 via fe80::5054:ff:fe02:1000 dev e1 proto ospf src 2001:678:d78:200::1 metric 20 pref medium
2001:678:d78:200::3 nhid 239 encap mpls 34 via fe80::5054:ff:fe02:1000 dev e1 proto ospf src 2001:678:d78:200::1 metric 20 pref medium
2001:678:d78:201::/112 nhid 231 encap mpls 2 via fe80::5054:ff:fe00:1001 dev e0 proto ospf src 2001:678:d78:200::1 metric 20 pref medium
2001:678:d78:201::1:0/112 dev e0 proto kernel metric 256 pref medium
2001:678:d78:201::2:0/112 dev e1 proto kernel metric 256 pref medium
2001:678:d78:201::3:0/112 nhid 237 encap mpls 2 via fe80::5054:ff:fe02:1000 dev e1 proto ospf src 2001:678:d78:200::1 metric 20 pref medium
root@vpp0-1:~# ip -f mpls ro
16 as to 0 via inet 192.168.10.9 dev e1 proto ldp
17 as to 0 via inet 192.168.10.9 dev e1 proto ldp
18 as to 2 via inet6 fe80::5054:ff:fe00:1001 dev e0 proto ldp
19 as to 2 via inet6 fe80::5054:ff:fe00:1001 dev e0 proto ldp
20 as to 0 via inet 192.168.10.6 dev e0 proto ldp
21 as to 0 via inet 192.168.10.6 dev e0 proto ldp
31 as to 2 via inet6 fe80::5054:ff:fe02:1000 dev e1 proto ldp
32 as to 2 via inet6 fe80::5054:ff:fe02:1000 dev e1 proto ldp
33 as to 33 via inet 192.168.10.9 dev e1 proto ldp
38 as to 34 via inet6 fe80::5054:ff:fe02:1000 dev e1 proto ldp
```
The first two tabled show a 'regular' Linux routing table for IPv4 and IPv6 respectively, except there's
an `encap mpls <X>` added for all not-directly-connected prefixes. In this case, `vpp0-1` connects on
`e0` to `vpp0-0` to the West, and on interface `e1` to `vpp0-2` to the East. These connected routes do
not carry MPLS information and in fact, this is how LDP can continue to work and exchange information
naturally even when no _LSPs_ are established yet.
The third table is the _MPLS FIB_, and it shows the special case of _MPLS Explicit NULL_ clearly. All IPv4
routes for which this router is the penultimate hop carry the outbound label value 0,S=1, while the IPv6
routes carry the value 2,S=1. Booyah!
### MPLS: VPP view
The _FIB_ information in general is super densely populated in VPP. Rather than dumping the whole table,
I'll show one example, for `192.168.10.3` which we can see above will be encapsulated into an MPLS
packet with label value 33,S=0 before being fowarded:
```
root@vpp0-1:~# vppctl show ip fib 192.168.10.3
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ]
192.168.10.3/32 fib:0 index:78 locks:2
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
path-list:[29] locks:6 flags:shared, uPRF-list:53 len:1 itfs:[2, ]
path:[41] pl-index:29 ip4 weight=1 pref=20 attached-nexthop: oper-flags:resolved,
192.168.10.9 GigabitEthernet10/0/1
[@0]: ipv4 via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 5254000210005254000110010800
Extensions:
path:41 labels:[[33 pipe ttl:0 exp:0]]
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:81 buckets:1 uRPF:53 to:[2421:363846]]
[0] [@13]: mpls-label[@4]:[33:64:0:eos]
[@1]: mpls via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:3 flags:[] 5254000210005254000110018847
```
The trick is looking at the Extensions, which shows the _out-labels_ set to 33, with ttl=0 (which makes
VPP copy the TTL from the IPv4 packet itself), and exp=0. It can then forward the packet as MPLS onto
the nexthop at 192.168.10.9 (`vpp0-2.e0` on Gi10/0/1).
The MPLS _FIB_ is also a bit chatty, and shows a fundamental difference with Linux:
```
root@vpp0-1:~# vppctl show mpls fib 33
MPLS-VRF:0, fib_index:0 locks:[interface:6, CLI:1, lcp-rt:1, ]
33:neos/21 fib:0 index:37 locks:2
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
path-list:[57] locks:12 flags:shared, uPRF-list:21 len:1 itfs:[2, ]
path:[81] pl-index:57 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
192.168.10.9 GigabitEthernet10/0/1
[@0]: ipv4 via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 5254000210005254000110010800
Extensions:
path:81 labels:[[33 pipe ttl:0 exp:0]]
forwarding: mpls-neos-chain
[@0]: dpo-load-balance: [proto:mpls index:40 buckets:1 uRPF:21 to:[0:0]]
[0] [@6]: mpls-label[@28]:[33:64:0:neos]
[@1]: mpls via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:3 flags:[] 5254000210005254000110018847
33:eos/21 fib:0 index:64 locks:2
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
path-list:[57] locks:12 flags:shared, uPRF-list:21 len:1 itfs:[2, ]
path:[81] pl-index:57 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
192.168.10.9 GigabitEthernet10/0/1
[@0]: ipv4 via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 5254000210005254000110010800
Extensions:
path:81 labels:[[33 pipe ttl:0 exp:0]]
forwarding: mpls-eos-chain
[@0]: dpo-load-balance: [proto:mpls index:67 buckets:1 uRPF:21 to:[73347:10747680]]
[0] [@6]: mpls-label[@29]:[33:64:0:eos]
[@1]: mpls via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:3 flags:[] 5254000210005254000110018847
```
I note that there are two entries here -- I wrote about them above. The MPLS implementation in VPP
allows for a different forwarding behavior in the case that the label inspected is the last one in the
stack (S=1), which is the usual case called _End of Stack (EOS)_. But, it also has a second entry
which tells it what to do if S=0 or _Not End of Stack (NEOS)_. Linux doesn't make the destinction, so
@vifino added two identical entries using that ***lcp_router_fib_route_path_dup()*** function.
But, what the entries themselves mean is that if this `vpp0-1` router were to receive an MPLS
packet with label value 33,S=1 (or value 33,S=0), it'll perform a _SWAP_ operation and put as
new outbound label (the same) value 33, and forward the packet as MPLS onto 192.168.10.9 on Gi10/0/1.
## Results
And with that, I think we achieved a running LDP with IPv4 and IPv6 and forwarding + encapsulation
of MPLS with VPP. One cool wrapup I thought I'd leave you with, is showing how these MPLS routers
are transparent with respect to IP traffic going through them. If I look at the diagram above, `lab`
reaches `vpp0-3` via three hops: first into `vpp0-0` where it is wrapped into MPLS and forwarded
to `vpp0-1`, and then through `vpp0-2`, which sets the _Explicit NULL_ label and forwards again
as MPLS onto `vpp0-3`, which does the IPv4 and IPv6 lookup.
Check this out:
```
pim@lab:~$ for node in $(seq 0 3); do traceroute -4 -q1 vpp0-$node; done
traceroute to vpp0-0 (192.168.10.0), 30 hops max, 60 byte packets
1 vpp0-0.lab.ipng.ch (192.168.10.0) 1.907 ms
traceroute to vpp0-1 (192.168.10.1), 30 hops max, 60 byte packets
1 vpp0-1.lab.ipng.ch (192.168.10.1) 2.460 ms
traceroute to vpp0-1 (192.168.10.2), 30 hops max, 60 byte packets
1 vpp0-2.lab.ipng.ch (192.168.10.2) 3.860 ms
traceroute to vpp0-1 (192.168.10.3), 30 hops max, 60 byte packets
1 vpp0-3.lab.ipng.ch (192.168.10.3) 4.414 ms
pim@lab:~$ for node in $(seq 0 3); do traceroute -6 -q1 vpp0-$node; done
traceroute to vpp0-0 (2001:678:d78:200::), 30 hops max, 80 byte packets
1 vpp0-0.lab.ipng.ch (2001:678:d78:200::) 3.037 ms
traceroute to vpp0-1 (2001:678:d78:200::1), 30 hops max, 80 byte packets
1 vpp0-1.lab.ipng.ch (2001:678:d78:200::1) 5.125 ms
traceroute to vpp0-1 (2001:678:d78:200::2), 30 hops max, 80 byte packets
1 vpp0-2.lab.ipng.ch (2001:678:d78:200::2) 7.135 ms
traceroute to vpp0-1 (2001:678:d78:200::3), 30 hops max, 80 byte packets
1 vpp0-3.lab.ipng.ch (2001:678:d78:200::3) 8.763 ms
```
With MPLS, each of these routers appears to the naked eye to be directly connected to the
`lab` headend machine, but we know better! :)
## What's next
I joined forces with [@vifino](https://chaos.social/@vifino) who has effectively added MPLS handling
to the Linux Control Plane, so VPP can start to function as an MPLS router using FRR's label
distribution protocol implementation. Gosh, I wish Bird3 would have LDP :)
Our work is mostly complete, there's two pending Gerrit's which should be ready to review and
certainly ready to play with:
1. [[Gerrit 38826](https://gerrit.fd.io/r/c/vpp/+/38826)]: This adds the ability to listen to internal
state changes of an interface, so that the Linux Control Plane plugin can enable MPLS on the
_LIP_ interfaces and Linux sysctl for MPLS input.
1. [[Gerrit 38702](https://gerrit.fd.io/r/c/vpp/+/38702)]: This adds the ability to listen to Netlink
messages in the Linux Control Plane plugin, and sensibly apply these routes to the IPv4, IPv6
and MPLS FIB in the VPP dataplane.
Finally, a note from your friendly neighborhood developers: this code is brand-new and has had _very
limited_ peer-review from the VPP developer community. It adds a significant feature to the Linux
Controlplane plugin, so make sure you both understand the semantics, the differences between Linux
and VPP, and the overall implementation before attempting to use in production. We're pretty sure
we got at least some of this right, but testing and runtime experience will tell.
I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!

View File

@ -0,0 +1,387 @@
---
date: "2023-05-28T10:08:14Z"
title: VPP MPLS - Part 4
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
**Special Thanks**: Adrian _vifino_ Pistol for writing this code and for the wonderful collaboration!
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
In the last three articles, I thought I had described "all we need to know" to perform MPLS using
the Linux Controlplane in VPP:
1. In the [[first article]({%post_url 2023-05-07-vpp-mpls-1 %})] of this series, I took a look at MPLS
in general.
2. In the [[second article]({% post_url 2023-05-17-vpp-mpls-2 %})] of the series, I demonstrated a few
special case labels (such as _Explicit Null_ and _Implicit Null_ which enables the fabled
_Penultimate Hop Popping_ behavior of MPLS.
3. Then, in the [[third article]({% post_url 2023-05-21-vpp-mpls-3%})], I worked with
[@vifino](https://chaos.social/@vifino) to implement the plumbing for MPLS in the Linux Control Plane
plugin for VPP. He did most of the work, I just watched :)
As if in a state of premonition, I mentioned:
> Caveat empor, outside of a modest functional and load-test, this MPLS functionality
> hasn't seen a lot of mileage as it's only a few weeks old at this point, so it could definitely
> contain some rough edges. Use at your own risk, but if you did want to discuss issues, the
> [[vpp-dev@](mailto:vpp-dev@lists.fd.io)] mailinglist is a good first stop.
## Introduction
{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
As a reminder, the LAB we built is running VPP with a feature added to Linux Control Plane Plugin,
which lets it consume MPLS routes and program the IPv4/IPv6 routing table as well as the MPLS
forwarding table in VPP. At this point, we are running [[Gerrit 38702, PatchSet
10](https://gerrit.fd.io/r/c/vpp/+/38702/10)].
First, let me specify the problem statement: @vifino and I both noticed that sometimes, pinging from
one VPP node to another worked fine, while SSHing did not. This article describes an issue I
diagnosed, and provided a fix for, in the Linux Controlplane plugin implementation.
### Clue 1: Intermittent ping
My first finding is that our LAB machines run _all_ the VPP plugins, notably the `ping` plugin,
which means that VPP was responding to ping/ping6, and the Linux controlplane plugin sometimes did
not receive any traffic, while other times it did receive the traffic, say a TCP/syn for port 22,
and dutifully responded to that, but that syn/ack was never seen back.
If I were to disable the `ping` plugin, indeed pinging from seemingly random pairs of `vpp0-[0123]`
no longer works, while pinging direct neighbors (eg. `vpp0-0.e1` to `vpp0-1.e0`) consistently works
well.
### Clue 2: Corrupted MPLS packets
Using the `tap0-0` virtual machine, which sees a copy of all packets on the Open vSwitch underlay in
our lab, I started tcpdumping and noticed two curious packets from time to time:
```
09:22:55.349977 52:54:00:03:10:00 > 52:54:00:02:10:01, ethertype 802.1Q (0x8100), length 140: vlan 22, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 2 (IPv6 explicit NULL), tc 0, [S], ttl 63)
version error: 4 != 6
09:23:00.357583 52:54:00:01:10:00 > 52:54:00:00:10:01, ethertype 802.1Q (0x8100), length 160: vlan 20, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 0 (IPv4 explicit NULL), tc 0, [S], ttl 61)
IP6, wrong link-layer encapsulation (invalid)
```
Looking at the payload of these broken packets, they are DNS packets coming from the `vpp0-3` Linux
Control Plane there, and they are being sent to either the IPv4 address of `192.168.10.4` or the
IPv6 address of `2001:678:d78:201::ffff`. Interestingly, these are the lab's resolvers, so I think
`vpp0-3` is just trying to resolve something.
### Clue 3: Vanishing MPLS packets
As I mentioned, some source/destination pairs in the lab do not seem to pass traffic, while others
are fine. One such case of _packetlo_ is any traffic from `vpp0-3` to the IPv4 address of
`vpp0-1.e0`. The path from `vpp0-3` to that IPv4 address should go out on `vpp0-3.e0` and into
`vpp0-2.e1`, but using tcpdump shows absolutely no such traffic at between `vpp0-3` and `vpp0-2`,
while I'd expect to see it on VLAN 22!
## Diagnosis
Well, based on **Clue 3**, I take a look at what is happening on `vpp0-3`. I start by looking at the Linux
controlplane view, where the route to `lab` looks like this:
```
root@vpp0-3:~$ ip route get 192.168.10.4
192.168.10.4/31 nhid 154 encap mpls 36 via 192.168.10.10 dev e0 proto ospf src 192.168.10.3 metric 20
root@vpp0-3:~$ tcpdump -evni e0 mpls 36
15:07:50.864605 52:54:00:03:10:00 > 52:54:00:02:10:01, ethertype MPLS unicast (0x8847), length 136:
MPLS (label 36, tc 0, [S], ttl 64)
(tos 0x0, ttl 64, id 15752, offset 0, flags [DF], proto UDP (17), length 118)
192.168.10.3.36954 > 192.168.10.4.53: 20950+ PTR?
1.9.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.0.0.8.7.d.0.8.7.6.0.1.0.0.2.ip6.arpa. (90)
```
Yes indeed, Linux is sending an IPv4 DNS packet out on `e0`, so what am I seeing on the switch fabric?
In the LAB diagram above, I can look up that traffic from `vpp0-3` destined to `vpp0-2` should show up
on VLAN 22:
```
root@tap0-0:~$ tcpdump -evni enp16s0f0 -s 1500 -X vlan 22 and mpls
15:19:56.453521 52:54:00:03:10:00 > 52:54:00:02:10:01, ethertype 802.1Q (0x8100), length 140: vlan 22, p 0,
ethertype MPLS unicast (0x8847), MPLS (label 2 (IPv6 explicit NULL), tc 0, [S], ttl 63)
version error: 4 != 6
0x0000: 0000 213f 4500 0076 d17e 4000 4011 d3a0 ..!?E..v.~@.@...
0x0010: c0a8 0a03 c0a8 0a04 e139 0035 0062 0dde .........9.5.b..
0x0020: 079e 0100 0001 0000 0000 0000 0131 0139 .............1.9
0x0030: 0130 0130 0130 0130 0130 0130 0130 0130 .0.0.0.0.0.0.0.0
0x0040: 0130 0130 0130 0130 0130 0130 0133 0130 .0.0.0.0.0.0.3.0
0x0050: 0130 0130 0138 0137 0164 0130 0138 0137 .0.0.8.7.d.0.8.7
0x0060: 0136 0130 0131 0130 0130 0132 0369 7036 .6.0.1.0.0.2.ip6
0x0070: 0461 7270 6100 000c 0001 .arpa.....
```
### MPLS Corruption
Ouch, that hurts my eyes! Linux sent an IPv4 packet into the TAP device carrying label value 36, so
why is it being observed as an _IPv6 Explicit Null_ with label value 2? That can't be right. In an
attempt to learn more, I ask VPP to give me a packet trace. I happen to remember that on the way
from Linux to VPP, the `virtio-input` driver is used (while, on the way from the wire to VPP, I see
`dpdk-input` is used).
The trace teaches me something really valuable:
```
vpp0-3# trace add virtio-input 100
vpp0-3# show trace
00:03:27:192490: virtio-input
virtio: hw_if_index 7 next-index 4 vring 0 len 136
hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 num_buffers 1
00:03:27:192500: ethernet-input
MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
00:03:27:192504: mpls-input
MPLS: next mpls-lookup[1] label 36 ttl 64 exp 0
00:03:27:192506: mpls-lookup
MPLS: next [6], lookup fib index 0, LB index 92 hash 0 label 36 eos 1
00:03:27:192510: mpls-label-imposition-pipe
mpls-header:[ipv6-explicit-null:63:0:eos]
00:03:27:192512: mpls-output
adj-idx 21 : mpls via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254000210015254000310008847 flow hash: 0x00000000
00:03:27:192515: GigabitEthernet10/0/0-output
GigabitEthernet10/0/0 flags 0x00180005
MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
label 2 exp 0, s 1, ttl 63
00:03:27:192517: GigabitEthernet10/0/0-tx
GigabitEthernet10/0/0 tx queue 0
buffer 0x4c2ea1: current data 0, length 136, buffer-pool 0, ref-count 1, trace handle 0x7
l2-hdr-offset 0 l3-hdr-offset 14
PKT MBUF: port 65535, nb_segs 1, pkt_len 136
buf_len 2176, data_len 136, ol_flags 0x0, data_off 128, phys_addr 0x730ba8c0
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
label 2 exp 0, s 1, ttl 63
```
At this point, I think I've figured it out. I can see clearly that the MPLS packet is seen coming
from Linux, and it has label value 36. But, it is then offered to graph node `mpls-input`, which does
what it is designed to do, namely look up the label in the FIB:
```
vpp0-3# show mpls fib 36
MPLS-VRF:0, fib_index:0 locks:[interface:4, CLI:1, lcp-rt:1, ]
36:neos/21 fib:0 index:88 locks:2
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
path-list:[50] locks:24 flags:shared, uPRF-list:38 len:1 itfs:[1, ]
path:[66] pl-index:50 ip6 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0
[@0]: ipv6 via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:4 flags:[] 52540002100152540003100086dd
Extensions:
path:66 labels:[[ipv6-explicit-null pipe ttl:0 exp:0]]
forwarding: mpls-neos-chain
[@0]: dpo-load-balance: [proto:mpls index:91 buckets:1 uRPF:38 to:[0:0]]
[0] [@6]: mpls-label[@34]:[ipv6-explicit-null:64:0:neos]
[@1]: mpls via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254000210015254000310008847
```
{{< image width="80px" float="left" src="/assets/vpp-mpls/lightbulb.svg" alt="Lightbulb" >}}
Haha, I love it when the brain-ligutbulb goes to the _on_ position. What's happening is that when we
turned on the MPLS feature on the VPP `tap` that is connected to `e0`, and VPP saw an MPLS packet,
that it looked up in the MPLS FIB what to do with label 36, learning that it must _SWAP_ it for
_IPv6 Explicit NULL_ (which is label value 2), and send it out on Gi10/0/0 to an IPv6 nexthop. Yeah,
**that'll break all right**.
### MPLS Drops
OK, that explains the garbled packets, but what about the ones that I never even saw on the wire
(**Clue 3**)? Well, now that I've enjoyed my lightbulb moment, I know exactly where to look.
Consider the following route in Linux, which is sending out encapsulated with MPLS label value 37;
and consider also what happens if `mpls-input` receives an MPLS frame with that value:
```
root@vpp0-3:~# ip ro get 192.168.10.6
192.168.10.6 encap mpls 37 via 192.168.10.10 dev e0 src 192.168.10.3 uid 0
vpp0-3# show mpls fib 37
MPLS-VRF:0, fib_index:0 locks:[interface:4, CLI:1, lcp-rt:1, ]
```
.. that's right, there ***IS*** no entry. As such, I would expect VPP to not know what to do with
such a mislabeled packet, and drop it. Unsurprisingly at this point, here's a nice proof:
```
00:10:31:107882: virtio-input
virtio: hw_if_index 7 next-index 4 vring 0 len 102
hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 num_buffers 1
00:10:31:107891: ethernet-input
MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
00:10:31:107897: mpls-input
MPLS: next mpls-lookup[1] label 37 ttl 64 exp 0
00:10:31:107898: mpls-lookup
MPLS: next [0], lookup fib index 0, LB index 22 hash 0 label 37 eos 1
00:10:31:107901: mpls-drop
drop
00:10:31:107902: error-drop
rx:tap1
00:10:31:107905: drop
mpls-input: MPLS DROP DPO
```
Conclusion: ***tadaa.wav***. When VPP receives the MPLS packet from Linux, it has already been
routed (encapsulated and put in an MPLS packet that's meant to be sent to the next router), so it
should be left alone. Instead, VPP is forcing the packet through the MPLS FIB, where if I'm lucky
(and I'm not, clearly ...) the right thing happens. But, sometimes, the MPLS FIB has instructions
that are different to what Linux had intended, bad things happen, and kittens get hurt. I can't
allow that to happen. I like kittens!
## Fixing Linux CP + MPLS
Now that I know what's actually going on, the fix comes quickly into focus. Of course, when Linux
sends an MPLS packet, VPP **must not** do a FIB lookup. Instead, it should emit the packet on the
correct interface as-is. It sounds a little bit like re-arranging the directed graph that VPP uses
internally. I've never done this before, but why not give it a go .. you know, for science :)
VPP has a concept called _feature arcs_. These are codepoints where features can be inserted and
turned on/off. There's a feature arc for MPLS called `mpls-input`. I can create a graph node that
does anything I'd like to the packets at this point, and what I want to do is take the packet and
instead of offering it to the `mpls-input` node, just emit it on its egress interface using
`interface-output`.
First, I call `VLIB_NODE_FN` which defines a new node in VPP, and I call it `lcp_xc_mpls()`. I
register this node with `VLIB_REGISTER_NODE` giving it the symbolic name `linux-cp-xc-mpls` which
extends the existing code in this plugin for ARP and IPv4/IPv6 forwarding. Once the packet enters my
new node, there are two possible places for it to go, defined by the `next_nodes` field:
1. ***LCP_XC_MPLS_NEXT_DROP***: If I can't figure out where this packet is headed (there should be
an existing adjacency for it), I will send it to `error-drop` where it will be discarded.
1. ***LCP_XC_MPLS_NEXT_IO***: If I do know, however, I ask VPP to send this packet simply to
`interface-output`, where it will be marshalled on the wire, unmodified.
Taking this short cut for MPLS packets avoids them being looked up in the FIB, and in hindsight this
is no different to how IPv4 and IPv6 packets are also short circuited: for those, `ip4-lookup` and
`ip6-lookup` are also not called, but instead `lcp_xc_inline()` does the business.
I can inform VPP that my new node should be attached as a feature on the `mpls-input` arc,
by calling `VNET_FEATURE_INIT` with it.
Implementing the VPP node is a bit of fiddling - but I take inspiration from the existing function
`lc_xc_inline()` which does this for IPv4 and IPv6. Really all I must do, is two things:
1. Using the _Linux Interface Pair (LIP)_ entry, figure out which physical interface corresponds
to the TAP interface I just received the packet on, and then set the TX interface to that.
1. Retrieve the ethernet adjacency based on the destination MAC address, use it to set the correct
L2 nexthop. If I don't know what adjacency to use, set `LCP_XC_MPLS_NEXT_DROP` as the next node,
otherwise set `LCP_XC_MPLS_NEXT_IO`.
The finishing touch on the graph node is to make sure that it's trace-aware. I use packet tracing _a
lot_, as can be seen as well in this article, so I'll detect if tracing for a given packet is turned
on, and if so, tack on a `lcp_xc_trace_t` object, so traces will reveal my new node in use.
Once the node is ready, I have one final step. When constructing the _Linux Interface Pair_ in
`lcp_itf_pair_add()`, I will enable the newly created feature called `linux-cp-xc-mpls` on the
`mpls-input` feature arc for the TAP interface, by calling `vnet_feature_enable_disable()`.
Conversely, I'll disable the feature when removing the _LIP_ in `lcp_itf_pair_del()`.
## Results
After rebasing @vifino's change, I add my code in [[Gerrit 38702, PatchSet
11-14](https://gerrit.fd.io/r/c/vpp/+/38702/11..14)]. I think the simplest thing to show the effect
of the change is by taking a look at these MPLS packets that come in from Linux Controlplane, and
how they now get moved into `linux-cp-xc-mpls` instead of `mpls-input` before:
```
00:04:12:846748: virtio-input
virtio: hw_if_index 7 next-index 4 vring 0 len 102
hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 num_buffers 1
00:04:12:846804: ethernet-input
MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
00:04:12:846811: mpls-input
MPLS: next BUG![3] label 37 ttl 64 exp 0
00:04:12:846812: linux-cp-xc-mpls
lcp-xc: itf:1 adj:21
00:04:12:846844: GigabitEthernet10/0/0-output
GigabitEthernet10/0/0 flags 0x00180005
MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
label 37 exp 0, s 1, ttl 64
00:04:12:846846: GigabitEthernet10/0/0-tx
GigabitEthernet10/0/0 tx queue 0
buffer 0x4be948: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
l2-hdr-offset 0 l3-hdr-offset 14
PKT MBUF: port 65535, nb_segs 1, pkt_len 102
buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1f9a5280
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01
label 37 exp 0, s 1, ttl 64
```
The same is true for the original DNS packet with MPLS label 36 -- it just transmits out on
`Gi10/0/0` with the same label, which is dope! Indeed, no more garbled MPLS packets are seen, and
the following simple acceptance test shows that all machines can reach all other machines on the LAB
cluster with both IPv4 and IPv6:
```
ipng@vpp0-3:~$ fping -g 192.168.10.0 192.168.10.3
192.168.10.0 is alive
192.168.10.1 is alive
192.168.10.2 is alive
192.168.10.3 is alive
ipng@vpp0-3:~$ fping6 2001:678:d78:200:: 2001:678:d78:200::1 2001:678:d78:200::2 2001:678:d78:200::3
2001:678:d78:200:: is alive
2001:678:d78:200::1 is alive
2001:678:d78:200::2 is alive
2001:678:d78:200::3 is alive
```
My ping test here from `vpp0-3` tries to ping (via the Linux controlplane) each of the other
routers, including itself. It first does this with IPv4, and then with IPv6, showing that all
_eight_ possible destinations are alive. Progress, sweet sweet progress.
I then expand that with this nice oneliner:
```
pim@lab:~$ for af in 4 6; do \
for node in $(seq 0 3); do \
ssh -$af ipng@vpp0-$node "fping -g 192.168.10.0 192.168.10.3; \
fping6 2001:678:d78:200:: 2001:678:d78:200::1 2001:678:d78:200::2 2001:678:d78:200::3"; \
done \
done | grep -c alive
64
```
Explanation: Taking both IPv4 and iPv6, I log in to all four nodes (so in total I invoke SSH 8
times), and then perform both `fping` operations, and receive each time _eight_ respondes,
_sixty-four_ in total. This checks out. I am very pleased with my work.
## What's next
I joined forces with [@vifino](https://chaos.social/@vifino) who has effectively added MPLS handling
to the Linux Control Plane, so VPP can start to function as an MPLS router using FRR's label
distribution protocol implementation. Gosh, I wish Bird3 would have LDP :)
Our work is mostly complete, there's two pending Gerrit's which should be ready to review and
certainly ready to play with:
1. [[Gerrit 38826](https://gerrit.fd.io/r/c/vpp/+/38826)]: This adds the ability to listen to internal
state changes of an interface, so that the Linux Control Plane plugin can enable MPLS on the
_LIP_ interfaces and Linux sysctl for MPLS input.
1. [[Gerrit 38702/10](https://gerrit.fd.io/r/c/vpp/+/38702/10)]: This adds the ability to listen to Netlink
messages in the Linux Control Plane plugin, and sensibly apply these routes to the IPv4, IPv6
and MPLS FIB in the VPP dataplane.
1. [[Gerrit 38702/14](https://gerrit.fd.io/r/c/vpp/+/38702/11..14)]: This Gerrit now also adds the
ability to directly output MPLS packets from Linux out on the correct interface, without
pulling it through the MPLS fib.
Finally, a note from your friendly neighborhood developers: this code is brand-new and has had _very
limited_ peer-review from the VPP developer community. It adds a significant feature to the Linux
Controlplane plugin, so make sure you both understand the semantics, the differences between Linux
and VPP, and the overall implementation before attempting to use in production. We're pretty sure
we got at least some of this right, but testing and runtime experience will tell.
I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!

View File

@ -0,0 +1,553 @@
---
date: "2023-08-06T05:35:14Z"
title: Pixelfed - Part 1 - Installing
---
# About this series
{{< image width="200px" float="right" src="/assets/pixelfed/pixelfed-logo.svg" alt="Pixelfed" >}}
I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I've been feeling less
enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using "free" services
is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but
for me it's time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to
privately operated ones.
After having written a fair bit about my Mastodon [[install]({% post_url 2022-11-20-mastodon-1 %})] and
[[monitoring]({% post_url 2022-11-27-mastodon-3 %})], I've been using it every day. This morning, my buddy Ram&oacute;n asked if he could
make a second account on **ublog.tech** for his _Campervan Adventures_, and notably to post pics of where he and his family went.
But if pics is your jam, why not ... [[Pixelfed](https://pixelfed.org/)]!
## Introduction
Similar to how blogging is the act of publishing updates to a website, microblogging is the act of publishing small updates to a stream of
updates on your profile. Very similar to the relationship between _Facebook_ and _Instagram_, _Mastodon_ and _Pixelfed_ give the ability to
post and share, cross-link, discuss, comment and like, across the entire _Fediverse_. Except, Pixelfed doesn't do this in a centralized
way, and I get to be a steward of my own data.
As is common in the _Fediverse_, groups of people congregate on a given server, of which they become a user by creating an account on that
server. Then, they interact with one another on that server, but users can also interact with folks on _other_ servers. Instead of following
**@IPngNetworks**, they might follow a user on a given server domain, like **@IPngNetworks@pix.ublog.tech**. This way, all these servers can
be run _independently_ but interact with each other using a common protocol (called ActivityPub). I've heard this concept be compared to
choosing an e-mail provider: I might choose Google's gmail.com, and you might use Microsoft's live.com. However we can send e-mails back and
forth due to this common protocol (called SMTP).
### pix.uBlog.tech
I thought I would give it a go, mostly out of engineering curiosity but also because I more strongly feel today that we (the users) ought to
take a bit more ownership back. I've been a regular blogging and micro-blogging user since approximately for ever, and I think it may be a
good investment of my time to learn a bit more about the architecture of Pixelfed. So, I've decided to build and _productionize_ a server
instance.
Previously, I registered [uBlog.tech](https://ublog.tech) and have been running that for about a year as a _Mastodon_ instance.
Incidentally, if you're reading this and would like to participate, the server welcomes users in the network-, systems- and software
engineering disciplines. But, before I can get to the fun parts though, I have to do a bunch of work to get this server in a shape in which
it can be trusted with user generated content.
### The IPng environment
#### Pixelfed: Virtual Machine
I provision a VM with 8vCPUs (dedicated on the underlying hypervisor), including 16GB of memory and one virtio network card. For disks, I
will use two block devices, one small one of 16GB (vda) that is created on the hypervisor's `ssd-vol1/libvirt/pixelfed-disk0`, to be used only
for boot, logs and OS. Then, a second one (vdb) is created at 2TB on `vol0/pixelfed-disk1` and it will be used for Pixelfed itself.
I simply install Debian into **vda** using `virt-install`. At IPng Networks we have some ansible-style automation that takes over the
machine, and further installs all sorts of Debian packages that we use (like a Prometheus node exporter, more on that later), and sets up a
firewall that allows SSH access for our trusted networks, and otherwise only allows port 80 because this is to be a (backend) webserver
behind the NGINX cluster.
After installing Debian Bullseye, I'll create the following ZFS filesystems on **vdb**:
```
pim@pixelfed:~$ sudo zpool create data /dev/vdb
pim@pixelfed:~$ sudo zfs create -o data/pixelfed -V10G
pim@pixelfed:~$ sudo zfs create -o mountpoint=/data/pixelfed/pixelfed/storage data/pixelfed-storage
pim@pixelfed:~$ sudo zfs create -o mountpoint=/var/lib/mysql data/mysql -V20G
pim@pixelfed:~$ sudo zfs create -o mountpoint=/var/lib/redis data/redis -V2G
```
As a sidenote, I realize that this ZFS filesystem pool consists only of **vdb**, but its underlying blockdevice is protected in a raidz, and
it is copied incrementally daily off-site by the hypervisor. I'm pretty confident on safety here, but I prefer to use ZFS for the virtual
machine guests as well, because now I can do local snapshotting, of say `data/pixelfed`, and I can more easily grow/shrink the
datasets for the supporting services, as well as isolate them individually against sibling wildgrowth.
The VM gets one virtual NIC, which will connect to the [[IPng Site Local]({% post_url 2023-03-17-ipng-frontends %})] network using
jumboframes. This way, the machine itself is disconnected from the internet, saving a few IPv4 addresses and allowing for the IPng NGINX
frontends to expose it. I give it the name `pixelfed.net.ipng.ch` with addresses 198.19.4.141 and 2001:678:d78:507::d, which will be
firewalled and NATed via the IPng SL gateways.
#### IPng Frontend: Wildcard SSL
I run most websites behind a cluster of NGINX webservers, which are carrying an SSL certificate which support wildcards. The system is
using [[DNS-01]({% post_url 2023-03-24-lego-dns01 %})] challenges, so the first order of business is to expand the certificate from serving
only [[ublog.tech](https://ublog.tech)] (which is in use by the companion Mastodon instance), to include as well _*.ublog.tech_ so that I can
add the new Pixelfed instance as [[pix.ublog.tech](https://pix.ublog.tech)]:
```
lego@lego:~$ certbot certonly --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs \
--work-dir /home/lego/workdir --manual \
--manual-auth-hook /home/lego/acme-dns/acme-dns-auth.py \
--preferred-challenges dns --debug-challenges \
-d ipng.ch -d *.ipng.ch -d *.net.ipng.ch \
-d ipng.nl -d *.ipng.nl \
-d ipng.eu -d *.ipng.eu \
-d ipng.li -d *.ipng.li \
-d ublog.tech -d *.ublog.tech \
-d as8298.net \
-d as50869.net
CERTFILE=/home/lego/acme-dns/live/ipng.ch/fullchain.pem
KEYFILE=/home/lego/acme-dns/live/ipng.ch/privkey.pem
MACHS="nginx0.chrma0.ipng.ch nginx0.chplo0.ipng.ch nginx0.nlams1.ipng.ch nginx0.nlams2.ipng.ch"
for MACH in $MACHS; do
fping -q $MACH 2>/dev/null || {
echo "$MACH: Skipping (unreachable)"
continue
}
echo $MACH: Copying $CERT
scp -q $CERTFILE $MACH:/etc/nginx/certs/$CERT.crt
scp -q $KEYFILE $MACH:/etc/nginx/certs/$CERT.key
echo $MACH: Reloading nginx
ssh $MACH 'sudo systemctl reload nginx'
done
```
The first command here requests a certificate with `certbot`, and note the addition of the flag `-d *.ublog.tech`. It'll correctly say that
there are 11 existing domains in this certificate, and ask me if I'd like to request a new cert with the 12th one added. I answer yes, and
a few seconds later, `acme-dns` has answered all of Let's Encrypt's challenges, and issues a certificate.
The second command then distributes that certificate to the four NGINX frontends, and reloads the cert. Now, I can use the hostname
`pix.ublog.tech`, as far as the SSL certs are concerned. Of course, the regular certbot cronjob renews the cert regularly, so I tucked away
the second part here into a script called `bin/certbot-distribute`, using the `RENEWED_LINEAGE` variable that certbot(1) sets when using the
flag `--deploy-hook`:
```
lego@lego:~$ cat /etc/cron.d/certbot
SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
0 */12 * * * lego perl -e 'sleep int(rand(43200))' && \
certbot -q renew --config-dir /home/lego/acme-dns \
--logs-dir /home/lego/logs --work-dir /home/lego/workdir \
--deploy-hook "/home/lego/bin/certbot-distribute"
```
#### IPng Frontend: NGINX
The previous `certbot-distribute` shell script has copied the certificate to four separate NGINX instances, two in Amsterdam hosted at
AS8283 (Coloclue), one in Zurich hosted at AS25091 (IP-Max), and one in Geneva hosted at AS8298 (IPng Networks). Each of these NGINX servers
has a frontend IPv4 and IPv6 address, and a backend jumboframe enabled interface in IPng Site Local (198.19.0.0/16). Because updating the
configuration on four production machines is cumbersome, I previously created an Ansible playbook, which I now add this new site to:
```
pim@squanchy:~/src/ipng-ansible$ cat roles/nginx/files/sites-available/pix.ublog.tech.conf
server {
listen [::]:80;
listen 0.0.0.0:80;
server_name pix.ublog.tech;
access_log /var/log/nginx/pix.ublog.tech-access.log;
include /etc/nginx/conf.d/ipng-headers.inc;
include "conf.d/lego.inc";
location / {
return 301 https://$host$request_uri;
}
}
server {
listen [::]:443 ssl http2;
listen 0.0.0.0:443 ssl http2;
ssl_certificate /etc/nginx/certs/ipng.ch.crt;
ssl_certificate_key /etc/nginx/certs/ipng.ch.key;
include /etc/nginx/conf.d/options-ssl-nginx.inc;
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
server_name pix.ublog.tech;
access_log /var/log/nginx/pix.ublog.tech-access.log upstream;
include /etc/nginx/conf.d/ipng-headers.inc;
keepalive_timeout 70;
sendfile on;
client_max_body_size 80m;
location / {
proxy_pass http://pixelfed.net.ipng.ch:80;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
```
The configuration is very straight forward. The first server block bounces all traffic destined to port 80 towards its port 443 equivalent.
The second server block (listening on port 443) contains the certificate I just renewed serve for `*.ublog.tech` which allows the cluster to
offload SSL and forward the traffic on the internal private network on to the VM I created earlier.
One quick Ansible playbook run later, and the reversed proxies are ready to rock and roll:
{{< image src="/assets/pixelfed/ansible.png" alt="Ansible" >}}
Of course, this website will just timeout for the time being, because there's nothing listening (yet) on `pixelfed.net.ipng.ch:80`.
#### Installing Pixelfed
So off I go, installing Pixelfed on the new Debian VM. First, I'll install the set of Debian packages this instance will need, including
PHP 8.1 (which is the minimum supported, according to the Pixelfed docs):
```
pim@pixelfed:~$ sudo apt install apt-transport-https lsb-release ca-certificates git wget curl \
build-essential apache2 mariadb-server pngquant optipng jpegoptim gifsicle ffmpeg redis
pim@pixelfed:~$ sudo wget -O /etc/apt/trusted.gpg.d/php.gpg https://packages.sury.org/php/apt.gpg
pim@pixelfed:~$ echo "deb https://packages.sury.org/php/ $(lsb_release -sc) main" \
| sudo tee -a /etc/apt/sources.list.d/php.list
pim@pixelfed:~$ apt update
pim@pixelfed:~$ apt-get install php8.1-fpm php8.1 php8.1-common php8.1-cli php8.1-gd \
php8.1-mbstring php8.1-xml php8.1-bcmath php8.1-pgsql php8.1-curl php8.1-xml php8.1-xmlrpc \
php8.1-imagick php8.1-gd php8.1-mysql php8.1-cli php8.1-intl php8.1-zip php8.1-redis
```
After all those bits and bytes settle on the filesystem, I simply follow the regular [[install
guide](https://docs.pixelfed.org/running-pixelfed/installation/)] from the upstream documentation.
I update the PHP config to allow larger uploads:
```
pim@pixelfed:~$ sudo vim /etc/php/8.1/fpm/php.ini
upload_max_filesize = 100M
post_max_size = 100M
```
I create a FastCGI pool for Pixelfed:
```
pim@pixelfed:~$ cat << EOF | sudo tee /etc/php/8.1/fpm/pool.d/pixelfed.conf
[pixelfed]
user = pixelfed
group = pixelfed
listen.owner = www-data
listen.group = www-data
listen.mode = 0660
listen = /var/run/php.pixelfed.sock
pm = dynamic
pm.max_children = 20
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 20
chdir = /data/pixelfed
php_flag[display_errors] = on
php_admin_value[error_log] = /data/pixelfed/php.error.log
php_admin_flag[log_errors] = on
php_admin_value[open_basedir] = /data/pixelfed:/usr/share/:/tmp:/var/lib/php
EOF
```
I reference this pool in a simple non-SSL Apache config, after enabling the modules that Pixelfed needs:
```
pim@pixelfed:~$ cat << EOF | sudo tee /etc/apache2/sites-available/pixelfed.conf
<VirtualHost *:80>
ServerName pix.ublog.tech
ServerAdmin pixelfed@ublog.tech
DocumentRoot /data/pixelfed/pixelfed/public
LogLevel debug
<Directory /data/pixelfed/pixelfed/public>
Options Indexes FollowSymLinks
AllowOverride All
Require all granted
</Directory>
ErrorLog ${APACHE_LOG_DIR}/pixelfed.error.log
CustomLog ${APACHE_LOG_DIR}/pixelfed.access.log combined
<FilesMatch \.php$>
SetHandler "proxy:unix:/var/run/php.pixelfed.sock|fcgi://localhost"
</FilesMatch>
</VirtualHost>
EOF
```
I create a user and database, and finally download the Pixelfed sourcecode and install the `composer` tool:
```
pim@pixelfed:~$ sudo useradd pixelfed -m -d /data/pixelfed -s /bin/bash -r -c "Pixelfed User"
pim@pixelfed:~$ sudo mysql
CREATE DATABASE pixelfed;
GRANT ALL ON pixelfed.* TO pixelfed@localhost IDENTIFIED BY '<redacted>';
exit
pim@pixelfed:~$ wget -O composer-setup.php https://getcomposer.org/installer
pim@pixelfed:~$ sudo php composer-setup.php
pim@pixelfed:~$ sudo cp composer.phar /usr/local/bin/composer
pim@pixelfed:~$ rm composer-setup.php
pim@pixelfed:~$ sudo su pixelfed
pixelfed@pixelfed:~$ git clone -b dev https://github.com/pixelfed/pixelfed.git pixelfed
pixelfed@pixelfed:~$ cd pixelfed
pixelfed@pixelfed:/data/pixelfed/pixelfed$ composer install --no-ansi --no-interaction --optimize-autoloader
pixelfed@pixelfed:/data/pixelfed/pixelfed$ composer update
```
With the basic installation of pacakges and dependencies all squared away, I'm ready to configure the instance:
```
pixelfed@pixelfed:/data/pixelfed/pixelfed$ vim .env
APP_NAME="uBlog Pixelfed"
APP_URL="https://pix.ublog.tech"
APP_DOMAIN="pix.ublog.tech"
ADMIN_DOMAIN="pix.ublog.tech"
SESSION_DOMAIN="pix.ublog.tech"
TRUST_PROXIES="*"
# Database Configuration
DB_CONNECTION="mysql"
DB_HOST="127.0.0.1"
DB_PORT="3306"
DB_DATABASE="pixelfed"
DB_USERNAME="pixelfed"
DB_PASSWORD="<redacted>"
MAIL_DRIVER=smtp
MAIL_HOST=localhost
MAIL_PORT=25
MAIL_FROM_ADDRESS="pixelfed@ublog.tech"
MAIL_FROM_NAME="uBlog Pixelfed"
pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan key:generate
pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan storage:link
pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan migrate --force
pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan import:cities
pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan instance:actor
pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan passport:keys
pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan route:cache
pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan view:cache
pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan config:cache
```
Pixelfed is based on [[Laravel](https://laravel.com/)], a PHP framework for _Web Artisans_ (which I guess now that I run both
LibreNMS, IXPManager and Pixelfed, makes me one too?). Laravel has two runner types commonly used. One is task queuing via a
module called _Laravel Horizon_, which uses Redis to store work items to be consumed by task workers:
```
pim@pixelfed:~$ cat << EOF | sudo tee /lib/systemd/system/pixelfed.service
[Unit]
Description=Pixelfed task queueing via Laravel Horizon
After=network.target
Requires=mariadb
Requires=php-fpm
Requires=redis
Requires=apache
[Service]
Type=simple
ExecStart=/usr/bin/php /data/pixelfed/pixelfed/artisan horizon
User=pixelfed
Restart=on-failure
[Install]
WantedBy=multi-user.target
pim@pixelfed:~$ sudo systemctl enable --now pixelfed
```
The other type of runner is periodic tasks, typically configured in a crontab, like so:
```
pim@pixelfed:~$ cat << EOF | sudo tee /etc/cron.d/pixelfed
* * * * * pixelfed /usr/bin/php /data/pixelfed/pixelfed/artisan schedule:run >> /dev/null 2>&1
EOF
```
After running the _schedule:run_ module once by hand, it exits cleanly, so I think this is good to go even though I'm not a huge fan of
redirecting output to `/dev/null` like that.
I will create one `admin` user on the commandline first:
```
pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan user:create
```
And now that everything is ready, I can put the icing on the cake by enabling and starting the Apache2 webserver:
```
pim@pixelfed:~$ sudo a2enmod rewrite proxy proxy_fcgi
pim@pixelfed:~$ sudo a2ensite pixelfed
pim@pixelfed:~$ sudo systemctl restart apache2
```
{{< image src="/assets/pixelfed/fipo.png" alt="First Post" >}}
### Finishing Touches
#### File permissions
After signing up, logging in and uploading my first post (which is of a BLT sandwich and a bowl of noodles, of course),
I noticed that the permissions are overly strict, and the pictures I just uploaded are not visible. I noticed that the PHP FastCGI is
running as user `pixelfed` while the webserver is running as user `www-data`, and the former is writing files with permissions `rw-------`
and directories with `rwx------`, which doesn't seem quite right to me, so I make a small edit in `config/filesystems.php`, changing the
`0600` to `0644` and the `0700` to `0755`, after which my post is visible.
#### uBlog's logo
{{< image width="300px" float="right" src="/assets/pixelfed/ublog.png" alt="uBlog" >}}
Although I do like the Pixelfed logo, I wanted to keep a **ublog.tech** branding, so I replaced the `public/storage/headers/default.jpg`
with my own mountains-picture in roughly the same size. By the way, I took that picture in Grindelwald, Switzerland during a
[[serene moment]({% post_url 2021-07-26-bucketlist %})] in which I discovered why tinkering with things like this is so important to my
mental health.
#### Backups
Of course, since Ram&oacute;n is a good friend, I would not want to lose his pictures. Data integrity and durability is important to me.
It's the one thing that typically the commercial vendors do really well, and my pride prohibits me from losing data due to things like "disk
failure" or "computer broken" or "datacenter on fire".
To honor this promise, I handle backups in three main ways: zrepl(1), borg(1) and mysqldump(1).
* **VM Block Devices** are running on the hypervisor's ZFS on either the SSD pool, or the disk pool, or both. Using a tool called **zrepl(1)**
(which I described a little bit in a [[previous post]({% post_url 2022-10-14-lab-1 %})]), I create a snapshot every 12hrs on the local
blockdevice, and incrementally copy away those snapshots daily to the remote fileservers.
```
pim@hvn0.ddln0:~$ sudo cat /etc/zrepl/zrepl.yaml
jobs:
- name: snap-libvirt
type: snap
filesystems: {
"ssd-vol0/libvirt<": true,
"ssd-vol1/libvirt<": true
}
snapshotting:
type: periodic
prefix: zrepl_
interval: 12h
pruning:
keep:
- type: grid
grid: 4x12h(keep=all) | 7x1d
regex: "^zrepl_.*"
- type: push
name: "push-st0-chplo0"
filesystems: {
"ssd-vol0/libvirt<": true,
"ssd-vol1/libvirt<": true
}
connect:
type: ssh+stdinserver
host: st0.chplo0.net.ipng.ch
user: root
port: 22
identity_file: /etc/zrepl/ssh/identity
snapshotting:
type: manual
send:
encrypted: false
pruning:
keep_sender:
- type: not_replicated
- type: last_n
count: 10
regex: ^zrepl_.*$ # optional
keep_receiver:
- type: grid
grid: 8x12h(keep=all) | 7x1d | 6x7d
regex: "^zrepl_.*"
```
* **Filesystem Backups** make a daily copy of their entire VM filesystem using **borgbackup(1)** to a set of two remote fileservers. This way,
the important file metadata, configs for the virtual machines, and so on, are all safely stored remotely.
```
pim@pixelfed:~$ sudo mkdir -p /etc/borgmatic/ssh
pim@pixelfed:~$ sudo ssh-keygen -t ecdsa -f /etc/borgmatic/ssh/identity -C root@pixelfed.net.ipng.ch
pim@pixelfed:~$ cat << EOF | sudo tee /etc/borgmatic/config.yaml
location:
source_directories:
- /
repositories:
- u022eaebe661@st0.chbtl0.ipng.ch:borg/{fqdn}
- u022eaebe661@st0.chplo0.ipng.ch:borg/{fqdn}
exclude_patterns:
- /proc
- /sys
- /dev
- /run
- /swap.img
exclude_if_present:
- .nobackup
- .borgskip
storage:
encryption_passphrase: <redacted>
ssh_command: "ssh -i /etc/borgmatic/identity -6"
compression: lz4
umask: 0077
lock_wait: 5
retention:
keep_daily: 7
keep_weekly: 4
keep_monthly: 6
consistency:
checks:
- repository
- archives
check_last: 3
output:
color: false
```
* **MySQL** has a running binary log to recover from failures/restarts, but I also run a daily mysqldump(1) operation that dumps the
database to the local filesystem, allowing for quick and painless recovery. As the dump is a regular file on the filesystem, it'll be
picked up by the filesystem backup every night as well, for long term and off-site safety.
```
pim@pixelfed:~$ sudo zfs create data/mysql-backups
pim@pixelfed:~$ cat << EOF | sudo tee /etc/cron.d/bitcron
25 5 * * * root /usr/local/bin/bitcron mysql-backup.cron
EOF
```
For my friends at AS12859 [[bit.nl](https://www.bit.nl/)], I still use `bitcron(1)` :-) For the rest of you -- bitcron is a little wrapper
written in Bash that defines a few primitives such as logging, iteration, info/warning/error/fatals etc, and then runs whatever you define
in a function called `bitcron_main()`, sending e-mail to an operator only if there are warnings or errors, and otherwise logging to
`/var/log/bitcron`. The gist of the mysql-backup bitcron is this:
```
echo "Rotating the $DESTDIR directory"
rotate 10
echo "Done (rotate)"
echo ""
echo "Creating $DESTDIR/0/ to store today's backup"
mkdir -p $DESTDIR/0 || fatal "Could not create $DESTDIR/0/"
echo "Done (mkdir)"
echo ""
echo "Fetching databases"
DBS=$(echo 'show databases' | mysql -u$MYSQLUSER -p$MYSQLPASS | egrep -v '^Database')
echo "Done (fetching DBs)"
echo ""
echo "Backing up all databases"
for DB in $DBS;
do
echo " * Database $DB"
mysqldump --single-transaction -u$MYSQLUSER -p$MYSQLPASS -a $DB | gzip -9 -c \
> $DESTDIR/0/mysqldump_$DB.gz \
|| warning "Could not dump database $DB"
done
echo "Done backing up all databases"
echo ""
```
## What's next
Now that the server is up, and I have a small amount of users (mostly folks I know from the tech industry), I took some time to explore
both the Fediverse, reach out to friends old and new, participate in a few random discussions possibly about food, datacenter pics and
camping trips, as well as fiddle with the iOS and Android apps (for now, I've settled on Vernissage after switching my iPhone away from the
horrible HEIC format which literally nobody supports). This is going to be fun :)
Now, I think I'm ready to further productionize the experience. It's important to monitor these applications, so in an upcoming post I'll be
looking at how to do _blackbox_ and _whitebox_ monitoring on this instance.
If you're looking for a home, feel free to sign up at [https://pix.ublog.tech/](https://pix.ublog.tech/) as I'm sure that having a bit more load /
traffic on this instance will allow me to learn (and in turn, to share with others)! Of course, my Mastodon instance at
[https://ublog.tech/](https://ublog.tech) is also happy to serve.

View File

@ -0,0 +1,628 @@
---
date: "2023-08-27T08:56:54Z"
title: 'Case Study: NGINX + Certbot with Ansible'
---
# About this series
{{< image width="200px" float="right" src="/assets/ansible/Ansible_logo.svg" alt="Ansible" >}}
In the distant past (to be precise, in November of 2009) I wrote a little piece of automation together with my buddy Paul, called
_PaPHosting_. The goal was to be able to configure common attributes like servername, config files, webserver and DNS configs in a
consistent way, tracked in Subversion. By the way despite this project deriving its name from the first two authors, our mutual buddy Jeroen
also started using it, and has written lots of additional cool stuff in the repo, as well as helped to move from Subversion to Git a few
years ago.
Michael DeHaan [[ref](https://www.ansible.com/blog/author/michael-dehaan)] founded Ansible in 2012, and by then our little _PaPHosting_
project, which was written as a set of bash scripts, had sufficiently solved our automation needs. But, as is the case with most home-grown
systems, over time I kept on seeing more and more interesting features and integrations emerge, solid documentation, large user group, and
eventually I had to reconsider our 1.5K LOC of Bash and ~16.5K files under maintenance, and in the end, I settled on Ansible.
```
commit c986260040df5a9bf24bef6bfc28e1f3fa4392ed
Author: Pim van Pelt <pim@ipng.nl>
Date: Thu Nov 26 23:13:21 2009 +0000
pim@squanchy:~/src/paphosting$ find * -type f | wc -l
16541
pim@squanchy:~/src/paphosting/scripts$ wc -l *push.sh funcs
132 apache-push.sh
148 dns-push.sh
92 files-push.sh
100 nagios-push.sh
178 nginx-push.sh
271 pkg-push.sh
100 sendmail-push.sh
76 smokeping-push.sh
371 funcs
1468 total
```
In a [[previous article]({% post_url 2023-03-17-ipng-frontends %})], I talked about having not one but a cluster of NGINX servers that would
each share a set of SSL certificates and pose as a reversed proxy for a bunch of websites. At the bottom of that article, I wrote:
> The main thing that's next is to automate a bit more of this. IPng Networks has an Ansible controller, which I'd like to add ...
> but considering Ansible is its whole own elaborate bundle of joy, I'll leave that for maybe another article.
**Tadaah.wav** that article is here! This is by no means an introduction or howto to Ansible. For that, please take a look at the
incomparable Jeff Geerling [[ref](https://www.jeffgeerling.com/)] and his book: [[Ansible for Devops](https://www.ansiblefordevops.com/)]. I
bought and read this book, and I highly recommend it.
## Ansible: Playbook Anatomy
The first thing I do is install four Debian Bookworm virtual machines, two in Amsterdam, one in Geneva and one in Zurich. These will be my
first group of NGINX servers, that are supposed to be my geo-distributed frontend pool. I don't do any specific configuration or
installation of packages, I just leave whatever deboostrap gives me, which is a relatively lean install with 8 vCPUs, 16GB of memory, a 20GB
boot disk and a 30G second disk for caching and static websites.
Ansible is a simple, but powerful, server and configuration management tool (with a few other tricks up its sleeve). It consists of an
_inventory_ (the hosts I'll manage), that are put in one or more _groups_, there is a registery of _variables_ (telling me things about
those hosts and groups), and an elaborate system to run small bits of automation, called _tasks_ organized in things called _Playbooks_.
### NGINX Cluster: Group Basics
First of all, I create an Ansible _group_ called **nginx** and I add the following four freshly installed virtual machine hosts to it:
```
pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee -a inventory/nodes.yml
nginx:
hosts:
nginx0.chrma0.net.ipng.ch:
nginx0.chplo0.net.ipng.ch:
nginx0.nlams1.net.ipng.ch:
nginx0.nlams2.net.ipng.ch:
EOF
```
I have a mixture of Debian and OpenBSD machines at IPng Networks, so I will add this group **nginx** as a child to another group called
**debian**, so that I can run "common debian tasks", such as installing Debian packages that I want all of my servers to have, adding users
and their SSH key for folks who need access, installing and configuring the firewall and things like Borgmatic backups.
I'm not going to go into all the details here for the **debian** playbook, though. It's just there to make the base system consistent across
all servers (bare metal or virtual). The one thing I'll mention though, is that the **debian** playbook will see to it that the correct
users are created, with their SSH pubkey, and I'm going to first use this feature by creating two users:
1. `lego`: As I described in a [[post on DNS-01]({% post_url 2023-03-24-lego-dns01 %})], IPng has a certificate machine that answers Let's
Encrypt DNS-01 challenges, and its job is to regularly prove ownership of my domains, and then request a (wildcard!) certificate.
Once that renews, copy the certificate to all NGINX machines. To do that copy, `lego` needs an account on these machines, it needs
to be able to write the certs and issue a reload to the NGINX server.
1. `drone`: Most of my websites are static, for example `ipng.ch` is generated by Jekyll. I typically write an article on my laptop, and
once I'm happy with it, I'll git commit and push it, after which a _Continuous Integration_ system called [[Drone](https://drone.io)]
gets triggered, builds the website, runs some tests, and ultimately copies it out to the NGINX machines. Similar to the first user,
this second user must have an account and the ability to write its web data to the NGINX server in the right spot.
That explains the following:
```yaml
pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee group_vars/nginx.yml
---
users:
lego:
comment: Lets Encrypt
password: "!"
groups: [ lego ]
drone:
comment: Drone CI
password: "!"
groups: [ www-data ]
sshkeys:
lego:
- key: ecdsa-sha2-nistp256 <hidden>
comment: lego@lego.net.ipng.ch
drone:
- key: ecdsa-sha2-nistp256 <hidden>
comment: drone@git.net.ipng.ch
```
I note that the `users` and `sshkeys` used here are dictionaries, and that the `users` role defines a few default accounts like my own
account `pim`, so writing this to the **group_vars** means that these new entries are applied to all machines that belong to the group
**nginx**, so they'll get these users created _in addition to_ the other users in the dictionary. Nifty!
### NGINX Cluster: Config
I wanted to be able to conserve IP addresses, and just a few months ago, had a discussion with some folks at Coloclue where we shared the
frustration that what was hip in the 90s (go to RIPE NCC and ask for a /20, justifying that with "I run SSL websites") is somehow still
being used today, even though that's no longer required, or in fact, desirable. So I take one IPv4 and IPv6 address and will use a TLS
extension called _Server Name Indication_ or [[SNI](https://en.wikipedia.org/wiki/Server_Name_Indication)], designed in 2003 (**20 years
old today**), which you can see described in [[RFC 3546](https://datatracker.ietf.org/doc/html/rfc3546)].
Folks who try to argue they need multiple IPv4 addresses because they run multiple SSL websites are somewhat of a trigger to me, so this
article doubles up as a "how to do SNI and conserve IPv4 addresses".
I will group my websites that share the same SSL certificate, and I'll call these things _clusters_. An IPng NGINX Cluster:
* is identified by a name, for example `ipng` or `frysix`
* is served by one or more NGINX servers, for example `nginx0.chplo0.ipng.ch` and `nginx0.nlams1.ipng.ch`
* serves one or more distinct websites, for example `www.ipng.ch` and `nagios.ipng.ch` and `go.ipng.ch`
* has exactly one SSL certificate, which should cover all of the website(s), preferably using wildcard certs, for example `*.ipng.ch,
ipng.ch`
And then, I define several clusters this way, in the following configuration file:
```yaml
pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee vars/nginx.yml
---
nginx:
clusters:
ipng:
members: [ nginx0.chrma0.net.ipng.ch, nginx0.chplo0.net.ipng.ch, nginx0.nlams1.net.ipng.ch, nginx0.nlams2.net.ipng.ch ]
ssl_common_name: ipng.ch
sites:
ipng.ch:
nagios.ipng.ch:
go.ipng.ch:
frysix:
members: [ nginx0.nlams1.net.ipng.ch, nginx0.nlams2.net.ipng.ch ]
ssl_common_name: frys-ix.net
sites:
frys-ix.net:
```
This way I can neatly group the websites (eg. the **ipng** websites) together, call them by name, and immediately see which servers are going to
be serving them using which certificate common name. For future expansion (hint: an upcoming article on monitoring), I decide to make the
**sites** element here a _dictionary_ with only keys and no values as opposed to a _list_, because later I will want to add some bits and
pieces of information for each website.
### NGINX Cluster: Sites
As is common with NGINX, I will keep a list of websites in the directory `/etc/nginx/sites-available/` and once I need a given machine to
actually serve that website, I'll symlink it from `/etc/nginx/sites-enabled/`. In addition, I decide to add a few common configuration
snippets, such as logging and SSL/TLS parameter files and options, which allow the webserver to score relatively high on SSL certificate
checker sites. It helps to keep the security buffs off my case.
So I decide on the following structure, each file to be copied to all nginx machines in `/etc/nginx/`:
```
roles/nginx/files/conf.d/http-log.conf
roles/nginx/files/conf.d/ipng-headers.inc
roles/nginx/files/conf.d/options-ssl-nginx.inc
roles/nginx/files/conf.d/ssl-dhparams.inc
roles/nginx/files/sites-available/ipng.ch.conf
roles/nginx/files/sites-available/nagios.ipng.ch.conf
roles/nginx/files/sites-available/go.ipng.ch.conf
roles/nginx/files/sites-available/go.ipng.ch.htpasswd
roles/nginx/files/sites-available/...
```
In order:
* `conf.d/http-log.conf` defines a custom logline type called `upstream` that contains a few interesting additional items that show me
the performance of NGINX:
> log_format upstream '$remote_addr - $remote_user [$time_local] ' '"$request" $status $body_bytes_sent ' '"$http_referer" "$http_user_agent" ' 'rt=$request_time uct=$upstream_connect_time uht=$upstream_header_time urt=$upstream_response_time';
* `conf.d/ipng-headers.inc` adds a header served to end-users from this NGINX, that reveals the instance that served the request.
Debugging a cluster becomes a lot easier if you know which server served what:
> add_header X-IPng-Frontend $hostname always;
* `conf.d/options-ssl-nginx.inc` and `conf.d/ssl-dhparams.inc` are files borrowed from Certbot's NGINX configuration, and ensure the best
TLS and SSL session parameters are used.
* `sites-available/*.conf` are the configuration blocks for the port-80 (HTTP) and port-443 (SSL certificate) websites. In the interest of
brevity I won't copy them here, but if you're curious I showed a bunch of these in a [[previous article]({% post_url
2023-03-17-ipng-frontends %})]. These per-website config files sensibly include the SSL defaults, custom IPng headers and `upstream` log
format.
### NGINX Cluster: Let's Encrypt
I figure the single most important thing to get right is how to enable multiple groups of websites, including SSL certificates, in multiple
_Clusters_ (say `ipng` and `frysix`), to be served using different SSL certificates, but on the same IPv4 and IPv6 address, using _Server
Name Indication_ or SNI. Let's first take a look at building these two of these certificates, one for [[IPng Networks](https://ipng.ch)] and
one for [[FrysIX](https://frys-ix.net/)], the internet exchange with Frysian roots, which incidentally offers free 1G, 10G, 40G and 100G
ports all over the Amsterdam metro. My buddy Arend and I are running that exchange, so please do join it!
I described the usual `HTTP-01` certificate challenge a while ago in [[this article]({% post_url 2023-03-17-ipng-frontends %})], but I
rarely use it because I've found that once installed, `DNS-01` is vastly superior. I wrote about the ability to request a single certificate
with multiple _wildcard_ entries in a [[DNS-01 article]({% post_url 2023-03-24-lego-dns01 %})], so I'm going to save you the repetition, and
simply use `certbot`, `acme-dns` and the `DNS-01` challenge type, to request the following _two_ certificates:
```bash
lego@lego:~$ certbot certonly --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs \
--work-dir /home/lego/workdir --manual --manual-auth-hook /home/lego/acme-dns/acme-dns-auth.py \
--preferred-challenges dns --debug-challenges \
-d ipng.ch -d *.ipng.ch -d *.net.ipng.ch \
-d ipng.nl -d *.ipng.nl \
-d ipng.eu -d *.ipng.eu \
-d ipng.li -d *.ipng.li \
-d ublog.tech -d *.ublog.tech \
-d as8298.net -d *.as8298.net \
-d as50869.net -d *.as50869.net
lego@lego:~$ certbot certonly --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs \
--work-dir /home/lego/workdir --manual --manual-auth-hook /home/lego/acme-dns/acme-dns-auth.py \
--preferred-challenges dns --debug-challenges \
-d frys-ix.net -d *.frys-ix.net
```
First off, while I showed how to get these certificates by hand, actually generating these two commands is easily doable in Ansible (which
I'll show at the end of this article!) I defined which cluster has which main certificate name, and which websites it's wanting to serve.
Looking at `vars/nginx.yml`, it becomes quickly obvious how I can automate this. Using a relatively straight forward construct, I can let
Ansible create for me a list of commandline arguments programmatically:
1. Initialize a variable `CERT_ALTNAMES` as a list of `nginx.clusters.ipng.ssl_common_name` and its wildcard, in other words `[ipng.ch,
*.ipng.ch]`.
1. As a convenience, tack onto the `CERT_ALTNAMES` list any entries in the `nginx.clusters.ipng.ssl_altname`, such as `[*.net.ipng.ch]`.
1. Then looping over each entry in the `nginx.clusters.ipng.sites` dictionary, use `fnmatch` to match it against any entries in the
`CERT_ALTNAMES` list:
* If it matches, for example with `go.ipng.ch`, skip and continue. This website is covered already by an altname.
* If it doesn't match, for example with `ublog.tech`, simply add it and its wildcard to the `CERT_ALTNAMES` list: `[ublog.tech, *.ublog.tech]`.
Now, the first time I run this for a new cluster (which has never had a certificate issued before), `certbot` will ask me to ensure the correct
`_acme-challenge` records are in each respective DNS zone. After doing that, it will issue two separate certificates and install a cronjob
that will periodically check the age, and renew the certificate(s) when they are up for renewal. In a post-renewal hook, I will create a
script that copies the new certificate to the NGINX cluster (using the `lego` user + SSH key that I defined above).
```bash
lego@lego:~$ find /home/lego/acme-dns/live/ -type f
/home/lego/acme-dns/live/README
/home/lego/acme-dns/live/frys-ix.net/README
/home/lego/acme-dns/live/frys-ix.net/chain.pem
/home/lego/acme-dns/live/frys-ix.net/privkey.pem
/home/lego/acme-dns/live/frys-ix.net/cert.pem
/home/lego/acme-dns/live/frys-ix.net/fullchain.pem
/home/lego/acme-dns/live/ipng.ch/README
/home/lego/acme-dns/live/ipng.ch/chain.pem
/home/lego/acme-dns/live/ipng.ch/privkey.pem
/home/lego/acme-dns/live/ipng.ch/cert.pem
/home/lego/acme-dns/live/ipng.ch/fullchain.pem
```
The crontab entry that Certbot normally installs makes soms assumptions on directory and which user is running the renewal. I am not a fan of
having the `root` user do this, so I've changed it to this:
```bash
lego@lego:~$ cat /etc/cron.d/certbot
0 */12 * * * lego perl -e 'sleep int(rand(43200))' && certbot -q renew \
--config-dir /home/lego/acme-dns --logs-dir /home/lego/logs \
--work-dir /home/lego/workdir \
--deploy-hook "/home/lego/bin/certbot-distribute"
```
And some pretty cool magic happens with this `certbot-distribute` script. When `certbot` has successfully received a new
certificate, it'll set a few environment variables and execute the deploy hook with them:
* ***RENEWED_LINEAGE***: will point to the config live subdirectory (eg. `/home/lego/acme-dns/live/ipng.ch`) containing the new
certificates and keys
* ***RENEWED_DOMAINS*** will contain a space-delimited list of renewed certificate domains (eg. `ipng.ch *.ipng.ch *.net.ipng.ch`)
Using the first of those two things, I guess it becomes straight forward to distribute the new certs:
```bash
#!/bin/sh
CERT=$(basename $RENEWED_LINEAGE)
CERTFILE=$RENEWED_LINEAGE/fullchain.pem
KEYFILE=$RENEWED_LINEAGE/privkey.pem
if [ "$CERT" = "ipng.ch" ]; then
MACHS="nginx0.chrma0.ipng.ch nginx0.chplo0.ipng.ch nginx0.nlams1.ipng.ch nginx0.nlams2.ipng.ch"
elif [ "$CERT" = "frys-ix.net" ]; then
MACHS="nginx0.nlams1.ipng.ch nginx0.nlams2.ipng.ch"
else
echo "Unknown certificate $CERT, do not know which machines to copy to"
exit 3
fi
for MACH in $MACHS; do
fping -q $MACH 2>/dev/null || {
echo "$MACH: Skipping (unreachable)"
continue
}
echo $MACH: Copying $CERT
scp -q $CERTFILE $MACH:/etc/nginx/certs/$CERT.crt
scp -q $KEYFILE $MACH:/etc/nginx/certs/$CERT.key
echo $MACH: Reloading nginx
ssh $MACH 'sudo systemctl reload nginx'
done
```
There are a few things to note, if you look at my little shell script. I already kind of know which `CERT` belongs to which `MACHS`,
because this was configured in `vars/nginx.yml`, where I have a cluster name, say `ipng`, which conveniently has two variables, one called
`members` which is a list of machines, and the second is `ssl_common_name` which is `ipng.ch`. I think that I can find a way to let
Ansible generate this file for me also, whoot!
### Ansible: NGINX
Tying it all together (frankly, a tiny bit surprised you're still reading this!), I can now offer an Ansible role that automates all of this.
```yaml
{%- raw %}
pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee roles/nginx/tasks/main.yml
- name: Install Debian packages
ansible.builtin.apt:
update_cache: true
pkg: [ nginx, ufw, net-tools, apache2-utils, mtr-tiny, rsync ]
- name: Copy config files
ansible.builtin.copy:
src: "{{ item }}"
dest: "/etc/nginx/"
owner: root
group: root
mode: u=rw,g=r,o=r
directory_mode: u=rwx,g=rx,o=rx
loop: [ conf.d, sites-available ]
notify: Reload nginx
- name: Add cluster
ansible.builtin.include_tasks:
file: cluster.yml
loop: "{{ nginx.clusters | dict2items }}"
loop_control:
label: "{{ item.key }}"
EOF
pim@squanchy:~/src/ipng-ansible$ cat << EOF > roles/nginx/handlers/main.yml
- name: Reload nginx
ansible.builtin.service:
name: nginx
state: reloaded
EOF
{% endraw %}
```
The first task installs the Debian packages I'll want to use. The `apache2-utils` package is to create and maintain `htpasswd` files and
some other useful things. The `rsync` package is needed to accept both website data from the `drone` continuous integration user, as well as
certificate data from the `lego` user.
The second task copies all of the (static) configuration files onto the machine, populating `/etc/nginx/conf.d/` and
`/etc/nginx/sites-available/`. It uses a `notify` stanza to make note if any of these files (notably the ones in `conf.d/`) have changed, and
if so, remember to invoke a _handler_ to reload the running NGINX to pick up those changes later on.
Finally, the third task branches out and executes the tasks defined in `tasks/cluster.yml` one for each NGINX cluster (in my case, `ipng`
and then `frysix`):
```yaml
{%- raw %}
pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee roles/nginx/tasks/cluster.yml
- name: "Enable sites for cluster {{ item.key }}"
ansible.builtin.file:
src: "/etc/nginx/sites-available/{{ sites_item.key }}.conf"
dest: "/etc/nginx/sites-enabled/{{ sites_item.key }}.conf"
owner: root
group: root
state: link
loop: "{{ (nginx.clusters[item.key].sites | default({}) | dict2items) }}"
when: inventory_hostname in nginx.clusters[item.key].members | default([])
loop_control:
loop_var: sites_item
label: "{{ sites_item.key }}"
notify: Reload nginx
EOF
{% endraw %}
```
This task is a bit more complicated, so let me go over it from outwards facing in. The thing that called us, already has a loop variable
called `item` which has a key (`ipng`) and a value (the whole cluster defined under `nginx.clusters.ipng`). Now if I take that
`item.key` variable and look at its `sites` dictionary (in other words: `nginx.clusters.ipng.sites`, I can create another loop over all the
sites belonging to that cluster. Iterating over a dictionary in Ansible is done with a filter called `dict2items`, and because technically
the cluster could have zero sites, I can ensure the `sites` dictionary defaults to the empty dictionary `{}`. Phew!
Ansible is running this for each machine, and of course I only want to execute this block, if the given machine (which is referenced as
`inventory_hostname` occurs in the clusters' `members` list. If not: skip, if yes: go! which is what the `when` line does.
The loop itself then runs for each site in the `sites` dictionary, allowing the `loop_control` to give that loop variable a unique name
called `sites_item`, and when printing information on the CLI, using the `label` set to the `sites_item.key` variable (eg. `frys-ix.net`)
rather than the whole dictionary belonging to it.
With all of that said, the inner loop is easy: create a (sym)link for each website config file from `sites-available` to `sites-enabled` and
if new links are created, invoke the _Reload nginx_ handler.
### Ansible: Certbot
***But what about that LEGO stuff?*** Fair question. The two scripts I described above (one to create the certbot certificate, and another
to copy it to the correct machines), both need to be generated and copied to the right places, so here I go, appending to the tasks:
```yaml
{%- raw %}
pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee -a roles/nginx/tasks/main.yml
- name: Create LEGO directory
ansible.builtin.file:
path: "/etc/nginx/certs/"
owner: lego
group: lego
mode: u=rwx,g=rx,o=
- name: Add sudoers.d
ansible.builtin.copy:
src: sudoers
dest: "/etc/sudoers.d/lego-ipng"
owner: root
group: root
- name: Generate Certbot Distribute script
delegate_to: lego.net.ipng.ch
run_once: true
ansible.builtin.template:
src: certbot-distribute.j2
dest: "/home/lego/bin/certbot-distribute"
owner: lego
group: lego
mode: u=rwx,g=rx,o=
- name: Generate Certbot Cluster scripts
delegate_to: lego.net.ipng.ch
run_once: true
ansible.builtin.template:
src: certbot-cluster.j2
dest: "/home/lego/bin/certbot-{{ item.key }}"
owner: lego
group: lego
mode: u=rwx,g=rx,o=
loop: "{{ nginx.clusters | dict2items }}"
EOF
pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee roles/nginx/files/sudoers
## *** Managed by IPng Ansible ***
#
%lego ALL=(ALL) NOPASSWD: /usr/bin/systemctl reload nginx
EOF
{% endraw -%}
```
The first task creates `/etc/nginx/certs` which will be owned by the user `lego`, and that's where Certbot will rsync the certificates after
renewal. The second task then allows `lego` user to issue a `systemctl reload nginx` so that NGINX can pick up the certificates once they've
changed on disk.
The third task generated the `certbot-distribute` script, that, depending on the common name of the certificate (for example `ipng.ch` or
`frys-ix.net`), knows which NGINX machines to copy it to. Its logic is pretty similar to the plain-old shellscript I started with, but does
have a few variable expansions. If you'll recall, that script had hard coded way to assemble the MACHS variable, which can be replaced now:
```bash
{%- raw %}
# ...
{% for cluster_name, cluster in nginx.clusters.items() | default({}) %}
{% if not loop.first%}el{% endif %}if [ "$CERT" = "{{ cluster.ssl_common_name }}" ]; then
MACHS="{{ cluster.members | join(' ') }}"
{% endfor %}
else
echo "Unknown certificate $CERT, do not know which machines to copy to"
exit 3
fi
{% endraw %}
```
One common Ansible trick here is to detect if a given loop has just begun (in which case `loop.first` will be true), or if this is the last
element in the loop (in which case `loop.last` will be true). I can use this to emit the `if` (first) versus `elif` (not first) statements.
Looking back at what I wrote in this _Certbot Distribute_ task, you'll see I used two additional configuration elements:
1. ***run_once***: Since there are potentially many machines in the **nginx** _Group_, by default Ansible will run this task for each machine. However, the Certbot cluster and distribute scripts really only need to be generated once per _Playbook_ execution, which is determined by this `run_once` field.
1. ***delegate_to***: This task should be executed not on an NGINX machine, rather instead on the `lego.net.ipng.ch` machine, which is specified by the `delegate_to` field.
#### Ansible: lookup example
And now for the _pièce de résistance_, the fourth and final task generates a shell script that captures for each cluster the primary name
(called `ssl_common_name`) and the list of alternate names which will turn into full commandline to request a certificate with all wildcard
domains added (eg. `ipng.ch` and `*.ipng.ch`). To do this, I decide to create an Ansible [[Lookup
Plugin](https://docs.ansible.com/ansible/latest/plugins/lookup.html)]. This lookup will simply return **true** if a given sitename is
covered by any of the existing certificace altnames, including wildcard domains, for which I can use the standard python `fnmatch`.
First, I can create the lookup plugin in a a well-known directory, so Ansible can discover it:
```
pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee roles/nginx/lookup_plugins/altname_match.py
import ansible.utils as utils
import ansible.errors as errors
from ansible.plugins.lookup import LookupBase
import fnmatch
class LookupModule(LookupBase):
def __init__(self, basedir=None, **kwargs):
self.basedir = basedir
def run(self, terms, variables=None, **kwargs):
sitename = terms[0]
cert_altnames = terms[1]
for altname in cert_altnames:
if sitename == altname:
return [True]
if fnmatch.fnmatch(sitename, altname):
return [True]
return [False]
EOF
```
The Python class here will compare the website name in `terms[0]` with a list of altnames given in
`terms[1]` and will return True either if a literal match occured, or if the altname `fnmatch` with the sitename.
It will return False otherwise. Dope! Here's how I use it in the `certbot-cluster` script, which is
starting to get pretty fancy:
```bash
{%- raw %}
pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee roles/nginx/templates/certbot-cluster.j2
#!/bin/sh
###
### {{ ansible_managed }}
###
{% set cluster_name = item.key %}
{% set cluster = item.value %}
{% set sites = nginx.clusters[cluster_name].sites | default({}) %}
#
# This script generates a certbot commandline to initialize (or re-initialize) a given certificate for an NGINX cluster.
#
### Metadata for this cluster:
#
# {{ cluster_name }}: {{ cluster }}
{% set cert_altname = [ cluster.ssl_common_name, '*.' + cluster.ssl_common_name ] %}
{% do cert_altname.extend(cluster.ssl_altname|default([])) %}
{% for sitename, site in sites.items() %}
{% set altname_matched = lookup('altname_match', sitename, cert_altname) %}
{% if not altname_matched %}
{% do cert_altname.append(sitename) %}
{% do cert_altname.append("*."+sitename) %}
{% endif %}
{% endfor %}
# CERT_ALTNAME: {{ cert_altname | join(' ') }}
#
###
certbot certonly --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs --work-dir /home/lego/workdir \
--manual --manual-auth-hook /home/lego/acme-dns/acme-dns-auth.py \
--preferred-challenges dns --debug-challenges \
{% for domain in cert_altname %}
-d {{ domain }}{% if not loop.last %} \{% endif %}
{% endfor %}
EOF
{% endraw %}
```
Ansible provides a lot of templating and logic evaluation in its Jinja2 templating language, but it isn't really a programming language.
That said, from the top, here's what happens:
* I set three variables, `cluster_name`, `cluster` (the dictionary with the cluster config) and as a shorthand `sites` which is a
dictionary of sites, defaulting to `{}` if it doesn't exist.
* I'll print the cluster name and the cluster config for posterity. Who knows, eventually I'll be debugging this anyway :-)
* Then comes the main thrust, the simple loop that I described above, but in Jinja2:
* Initialize the `cert_altname` list with the `ssl_common_name` and its wildcard variant, optionally extending it with the list
of altnames in `ssl_altname`, if it's set.
* For each site in the sites dictionary, invoke the lookup and capture its (boolean) result in `altname_matched`.
* If the match failed, we have a new domain, so add it and its wildcard variant to the `cert_altname` list, I use the `do`
Jinja2 extension there comes from package `jinja2.ext.do`.
* At the end of this, all of these website names have been reduced to their domain+wildcard variant, which I can loop over to emit
the `-d` flags to `certbot` at the bottom of the file.
And with that, I can generate both the certificate request command, and distribute the resulting
certificates to those NGINX servers that need them.
## Results
{{< image src="/assets/ansible/ansible-run.png" alt="Ansible Run" >}}
I'm very pleased with the results. I can clearly see that the two servers that I assigned to this
NGINX cluster (the two in Amsterdam) got their sites enabled, whereas the other two (Zurich and
Geneva) were skipped. I can also see that the new certbot request scripts was generated and the
existing certbot-distribute script was updated (to be aware of where to copy a renewed cert for this
cluster). And, in the end only the two relevant NGINX servers were reloaded, reducing overall risk.
One other way to show that the very same IPv4 and IPv6 address can be used to serve multiple
distinct multi-domain/wildcard SSL certificates, using this _Server Name Indication_ (SNI, which, I
repeat, has been available **since 2003** or so), is this:
```bash
pim@squanchy:~$ HOST=nginx0.nlams1.ipng.ch
pim@squanchy:~$ PORT=443
pim@squanchy:~$ SERVERNAME=www.ipng.ch
pim@squanchy:~$ openssl s_client -connect $HOST:$PORT -servername $SERVERNAME </dev/null 2>/dev/null \
| openssl x509 -text | grep DNS: | sed -e 's,^ *,,'
DNS:*.ipng.ch, DNS:*.ipng.eu, DNS:*.ipng.li, DNS:*.ipng.nl, DNS:*.net.ipng.ch, DNS:*.ublog.tech,
DNS:as50869.net, DNS:as8298.net, DNS:ipng.ch, DNS:ipng.eu, DNS:ipng.li, DNS:ipng.nl, DNS:ublog.tech
pim@squanchy:~$ SERVERNAME=www.frys-ix.net
pim@squanchy:~$ openssl s_client -connect $HOST:$PORT -servername $SERVERNAME </dev/null 2>/dev/null \
| openssl x509 -text | grep DNS: | sed -e 's,^ *,,'
DNS:*.frys-ix.net, DNS:frys-ix.net
```
Ansible is really powerful, and once I got to know it a little bit, will readily admit it's way
cooler than PaPhosting ever was :)
## What's Next
If you remember, I wrote that the `nginx.clusters.*.sites` would not be a list but rather a
dictionary, because I'd like to be able to carry other bits of information. And if you take a close
look at my screenshot above, you'll see I revealed something about Nagios... so in an upcoming post
I'd like to share how IPng Networks arranges its Nagios environment, and I'll use the NGINX configs
here to show how I automatically monitor all servers participating in an NGINX _Cluster_, both for
pending certificate expiry, which should not generally happen precisely due to the automation here,
but also in case any backend server takes the day off.
Stay tuned! Oh, and if you're good at Ansible and would like to point out how silly I approach
things, please do drop me a line on Mastodon, where you can reach me on
[[@IPngNetworks@ublog.tech](https://ublog.tech/@IPngNetworks)].

View File

@ -0,0 +1,308 @@
---
date: "2023-10-21T11:35:14Z"
title: VPP IXP Gateway - Part 1
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
There's some really fantastic features in VPP, some of which are lesser well known, and not always
very well documented. In this article, I will describe a unique usecase in which I think VPP will
excel, notably acting as a gateway for Internet Exchange Points.
In this first article, I'll take a closer look at three things that would make such a gateway
possible: bridge domains, MAC address filtering and traffic shaping.
## Introduction
Internet Exchanges are typically L2 (ethernet) switch platforms that allow their connected members
to exchange traffic amongst themselves. Not all members share physical locations with the Internet
Exchange itself, for example the IXP may be at NTT Zurich, but the member may be present in
Interxion Zurich. For smaller clubs, like IPng Networks, it's not always financially feasible (or
desirable) to order a dark fiber between two adjacent datacenters, or even a cross connect in the
same datacenter (as many of them are charging exorbitant fees for what is essentially passive fiber
optics and patch panels), if the amount of traffic passed is modest.
One solution to such problems is to have one member transport multiple end-user downstream members
to the platform, for example by means of an Ethernet over MPLS or VxLAN transport from where the
enduser lives, to the physical port of the Internet Exchange. These transport members are often
called _IXP Resellers_ noting that usually, but not always, some form of payment is required.
From the point of view of the IXP, it's often the case that there is a one MAC address per member
limitation, and not all members will have the same bandwidth guarantees. Many IXPs will offer
physical connection speeds (like a Gigabit, TenGig or HundredGig port), but they _also_ have a
common practice to limit the passed traffic by means of traffic shaping, for example one might have
a TenGig port but only entitled to pass 3.0 Gbit/sec of traffic in- and out of the platform.
For a long time I thought this kind of sucked, after all, who wants to connect to an internet
exchange point but then see their traffic rate limited? But if you think about it, this is often to
protect both the member, and the reseller, and the exchange itself: if the total downstream
bandwidth to the reseller is potentially larger than the reseller's port to the exchange, and this
is almost certainly the case in the other direction: the total IXP bandwidth that might go to one
individual members, is significantly larger than the reseller's port to the exchange.
Due to these two issues, a reseller port may become a bottleneck and _packetlo_ may occur. To
protect the ecosystem, having the internet exchange try to enforce fairness and bandwidth limits
makes operational sense.
## VPP as an IXP Gateway
{{< image width="400px" float="right" src="/assets/vpp-ixp-gateway/VPP IXP Gateway.svg" alt="VPP IXP Gateway" >}}
Here's a few requirements that may be necessary to provide an end-to-end solution:
1. Downstream ports MAY be _untagged_, or _tagged_, in which case encapsulation (for example
.1q VLAN tags) SHOULD be provided, one per downstream member.
1. Each downstream member MUST ONLY be allowed to send traffic from one or more registered MAC
addresses, in other words, strict filtering MUST be applied by the gateway.
1. If a downstream member is assigned an up- and downstream bandwidth limit, this MUST be
enforced by the gateway.
Of course, all sorts of other things come to mind -- perhaps MPLS encapsulation, or VxLAN/GENEVE
tunneling endpoints, and certainly some monitoring with SNMP or Prometheus, and how about just
directly integrating this gateway with [[IXPManager](https://www.ixpmanager.org/)] while we're at
it. Yes, yes! But for this article, I'm going to stick to the bits and pieces regarding VPP itself,
and leave the other parts for another day!
First, I build a quick lab out of this, by taking one supermicro bare metal server with VPP (it will
be the VPP IXP Gateway), and a couple of Debian servers and switches to simulate clients (A-J):
* Client A-D (on port `e0`-`e3`) will use `192.0.2.1-4/24` and `2001:db8::1-5/64`
* Client E-G (on switch port `e0`-`e2` of switch0, behind port `xe0`) will use `192.0.2.5-7/24` and
`2001:db8::5-7/64`
* Client H-J (on switch port `e0`-`e2` of switch1, behind port `xe1`) will use `192.0.2.8-10/24` and
`2001:db8::8-a/64`
* There will be a server attached to port `xxv0` with address `198.0.2.254/24` and `2001:db8::ff/64`
* The server will run `iperf3`.
### VPP: Bridge Domains
The fundamental topology described in the picture above tries to bridge together a bunch of untagged
ports (`e0`..`e3` 1Gbit each)) with two tagged ports (`xe0` and `xe1`, 10Gbit) into an upstream IXP
port (`xxv0`, 25Gbit). One thing to note for the pedants (and I love me some good pedantry) is that
the total physical bandwidth to downstream members in this gateway (4x1+2x10 == 24Gbit) is lower
than the physical bandwidth to the IXP platform (25Gbit), which makes sense. It means that there
will not be contention per se.
Building this topology in VPP is rather straight forward by using a so called **Bridge Domain**,
which will be referred to by its bridge-id, for which I'll rather arbitrarily choose 8298:
```
vpp# create bridge-domain 8298
vpp# set interface l2 bridge xxv0 8298
vpp# set interface l2 bridge e0 8298
vpp# set interface l2 bridge e1 8298
vpp# set interface l2 bridge e2 8298
vpp# set interface l2 bridge e3 8298
vpp# set interface l2 bridge xe0 8298
vpp# set interface l2 bridge xe1 8298
```
### VPP: Bridge Domain Encapsulations
I cheated a little bit in the previous section: I added the two TenGig ports called `xe0` and `xe1`
directly to the bridge; however they are trunk ports to breakout switches which will each contain
three additional downstream customers. So to add these six new customers, I will do the following:
```
vpp# set interface l3 xe0
vpp# create sub-interfaces xe0 10
vpp# create sub-interfaces xe0 20
vpp# create sub-interfaces xe0 30
vpp# set interface l2 bridge xe0.10 8298
vpp# set interface l2 bridge xe0.20 8298
vpp# set interface l2 bridge xe0.30 8298
```
The first command here puts the interface `xe0` back into Layer3 mode, which will detach it from the
bridge-domain. The second set of commands creates sub-interfaces with dot1q tags 10, 20 and 30
respectively. The third set then adds these three sub-interfaces to the bridge. By the way, I'll do
this for both `xe0` shown above, but also for the second `xe1` port, so all-up that makes 6
downstream member ports.
Readers of my articles at this point may have a little bit of an uneasy feeling: "What about the
VLAN Gymnastics?" I hear you ask :) You see, VPP will generally just pick up these ethernet frames
from `xe0.10` which are tagged, and add them as-is to the bridge, which is weird, because all the
other bridge ports are expecting untagged frames. So what I must do is tell VPP, upon receipt of a
tagged ethernet frame on these ports, to strip the tag; and on the way out, before transmitting the
ethernet frame, to wrap it into its correct encapsulation. This is called **tag rewriting** in VPP,
and I've written a bit about it in [[this article]({% post_url 2022-02-14-vpp-vlan-gym %})] in case
you're curious. But to cut to the chase:
```
vpp# set interface l2 tag-rewrite xe0.10 pop 1
vpp# set interface l2 tag-rewrite xe0.20 pop 1
vpp# set interface l2 tag-rewrite xe0.30 pop 1
vpp# set interface l2 tag-rewrite xe1.10 pop 1
vpp# set interface l2 tag-rewrite xe1.20 pop 1
vpp# set interface l2 tag-rewrite xe1.30 pop 1
```
Allright, with the VLAN gymnastics properly applied, I now have a bridge with all ten downstream
members and one upstream port (`xxv0`):
```
vpp# show bridge-domain 8298 int
BD-ID Index BSN Age(min) Learning U-Forwrd UU-Flood Flooding ARP-Term arp-ufwd Learn-co Learn-li BVI-Intf
8298 1 0 off on on flood on off off 1 16777216 N/A
Interface If-idx ISN SHG BVI TxFlood VLAN-Tag-Rewrite
xxv0 3 1 0 - * none
e0 5 1 0 - * none
e1 6 1 0 - * none
e2 7 1 0 - * none
e3 8 1 0 - * none
xe0.10 19 1 0 - * pop-1
xe0.20 20 1 0 - * pop-1
xe0.30 21 1 0 - * pop-1
xe1.10 22 1 0 - * pop-1
xe1.20 23 1 0 - * pop-1
xe1.30 24 1 0 - * pop-1
```
One cool thing to re-iterate is that VPP is really a router, not a switch. It's
entirely possible and common to create two completely independent subinterfaces with .1q tag 10 (in
my case, `xe0.10` and `xe1.10` and use the bridge-domain to tie them together.
#### Validating Bridge Domains
Looking at my clients above, I can see that several of them are untagged (`e0`-`e3`) and a few of
them are tagged behind ports `xe0` and `xe1`. It should be straight forward to validate reachability
with the following simple ping command:
```
pim@clientA:~$ fping -a -g 192.0.2.0/24
192.0.2.1 is alive
192.0.2.2 is alive
192.0.2.3 is alive
192.0.2.4 is alive
192.0.2.5 is alive
192.0.2.6 is alive
192.0.2.7 is alive
192.0.2.8 is alive
192.0.2.9 is alive
192.0.2.10 is alive
192.0.2.254 is alive
```
At this point the table stakes configuration provides for a Layer2 bridge domain spanning all of
these ports, including performing the correct encapsulation on the TenGig ports that connect to
the switches. There is L2 reachability between all clients over this VPP IXP Gateway.
**✅ Requirement #1 is implemented!**
### VPP: MAC Address Filtering
Enter classifiers! Actually while doing the research for this article, I accidentally nerd-sniped
myself while going through the features provided by VPP's classifier system, and holy moly is that
thing powerful!
I'm only going to show the results of that little journey through the code base and documentation,
but in an upcoming article I intend to do a thorough deep-dive into VPP classifiers, and add them to
`vppcfg` because I think that would be the bee's knees!
Back to the topic of MAC address filtering, a classifier would look roughly like this:
```
vpp# classify table acl-miss-next deny mask l2 src table 5
vpp# classify session acl-hit-next permit table-index 5 match l2 src 00:01:02:03:ca:fe
vpp# classify session acl-hit-next permit table-index 5 match l2 src 00:01:02:03:d0:d0
vpp# set interface input acl intfc e0 l2-table 5
vpp# show inacl type l2
Intfc idx Classify table Interface name
5 5 e0
```
The first line create a classify table where we'll want to match on Layer2 source addresses, and if
there is no entry in the table that matches, the default will be to _deny_ (drop) the ethernet
frame. The next two lines add an entry for ethernet frames which have Layer2 source of the _cafe_
and _d0d0_ MAC addresses. When matching, the action is to _permit_ (accept) the ethernet frame.
Then, I apply this classifier as an l2 input ACL on interface `e0`.
Incidentally the input ACL can operate at five distinct points in the packet's journey through
the dataplane. At the Layer2 input stage, like I'm using here, in the IPv4 and IPv6 input path, and
when punting traffic for IPv4 and IPv6 respectively.
#### Validating MAC filtering
Remember when I created the classify table and added two bogus MAC addresses to it? Let me show you
what would happen on client A, which is directly connected to port `e0`.
```
pim@clientA:~$ ip -br link show eno3
eno3 UP 3c:ec:ef:6a:7b:74 <BROADCAST,MULTICAST,UP,LOWER_UP>
pim@clientA:~$ ping 192.0.2.254
PING 192.0.2.254 (192.0.2.254) 56(84) bytes of data.
...
```
This is expected because ClientA's MAC address has not yet been added to the classify table driving
the Layer2 input ACL, which is quicky remedied like so:
```
vpp# classify session acl-hit-next permit table-index 5 match l2 src 3c:ec:ef:6a:7b:74
...
64 bytes from 192.0.2.254: icmp_seq=34 ttl=64 time=2048 ms
64 bytes from 192.0.2.254: icmp_seq=35 ttl=64 time=1024 ms
64 bytes from 192.0.2.254: icmp_seq=36 ttl=64 time=0.450 ms
64 bytes from 192.0.2.254: icmp_seq=37 ttl=64 time=0.262 ms
```
**✅ Requirement #2 is implemented!**
### VPP: Traffic Policers
I realize that from the IXP's point of view, not all the available bandwidth behind `xxv0` should be
made available to all clients. Some may have negotiated a higher- or lower- bandwidth available to
them. Therefor, the VPP IXP Gateway should be able to rate limit the traffic through the it, for
which a VPP feature already exists: Policers.
Consider for a moment our client A (untagged on port `e0`), and client E (behind port `xe0` with a
dot1q tag of 10). Client A has a bandwidth of 1Gbit, but client E nominally has a bandwidth of
10Gbit. If I were to want to restrict both clients to, say, 150Mbit, I could do the following:
```
vpp# policer add name client-a rate kbps cir 150000 cb 15000000 conform-action transmit
vpp# policer input name client-a e0
vpp# policer output name client-a e0
vpp# policer add name client-e rate kbps cir 150000 cb 15000000 conform-action transmit
vpp# policer input name client-e xe0.10
vpp# policer output name client-e xe0.10
```
And here's where I bump into a stubborn VPP dataplane. I would've expected the input and output
packet shaping to occur on both the untagged interface `e0` as well as the tagged interface
`xe0.10`, but alas, the policer only works in one of these four cases. Ouch!
I read the code around `vnet/src/policer/` and understand the following:
* On input, the policer is applied on `device-input` which is the Phy, not the Sub-Interface. This
explains why the policer works on untagged, but not on tagged interfaces.
* On output, the policer is applied on `ip4-output` and `ip6-output`, which works only for L3
enabled interfaces, not for L2 ones like the ones in this bridge domain.
I also tried to work with classifiers, like in the MAC address filtering above -- but I concluded
here as well, that the policer works only on input, not on output. So the mission is now to figure
out how to enable an L2 policer on (1) untagged output, and (2) tagged in- and output.
**❌ Requirement #3 is not implemented!**
## What's Next
It's too bad that policers are a bit fickle. That's quite unfortunate, but I think fixable. I've
started a thread on `vpp-dev@` to discuss, and will reach out to Stanislav who added the
_policer output_ capability in commit `e5a3ae0179`.
Of course, this is just a proof of concept. I typed most of the configuration by hand on the VPP IXP
Gateway, just to show a few of the more advanced features of VPP. For me, this triggered a whole new
line of thinking: classifiers. This extract/match/act pattern can be used in policers, ACLs and
arbitrary traffic redirection through VPP's directed graph (eg. selecting a next node for
processing). I'm going to deep-dive into this classifier behavior in an upcoming article, and see
how I might add this to [[vppcfg](https://github.com/pimvanpelt/vppcfg.git)], because I think it
would be super powerful to abstract away the rather complex underlying API into something a little
bit more ... user friendly. Stay tuned! :)

View File

@ -0,0 +1,827 @@
---
date: "2023-11-11T13:31:00Z"
title: Debian on Mellanox SN2700 (32x100G)
---
# Introduction
I'm still hunting for a set of machines with which I can generate 1Tbps and 1Gpps of VPP traffic,
and considering a 100G network interface can do at most 148.8Mpps, I will need 7 or 8 of these
network cards. Doing a loadtest like this with DACs back-to-back is definitely possible, but it's a
bit more convenient to connect them all to a switch. However, for this to work I would need (at
least) fourteen or more HundredGigabitEthernet ports, and these switches tend to get expensive, real
quick.
Or do they?
## Hardware
{{< image width="400px" float="right" src="/assets/mlxsw/sn2700-1.png" alt="SN2700" >}}
I thought I'd ask the #nlnog IRC channel for advice, and of course the usual suspects came past,
such as Juniper, Arista, and Cisco. But somebody mentioned "How about Mellanox, like SN2700?" and I
remembered my buddy Eric was a fan of those switches. I looked them up on the refurbished market and
I found one for EUR 1'400,- for 32x100G which felt suspiciously low priced... but I thought YOLO and
I ordered it. It arrived a few days later via UPS from Denmark to Switzerland.
The switch specs are pretty impressive, with 32x100G QSFP28 ports, which can be broken out to a set
of sub-ports (each of 1/10/25/50G), with a specified switch throughput of 6.4Tbps in 4.76Gpps, while
only consuming ~150W all-up.
Further digging revealed that the architecture of this switch consists of two main parts:
{{< image width="400px" float="right" src="/assets/mlxsw/sn2700-2.png" alt="SN2700" >}}
1. an AMD64 component with an mSATA disk to boot from, two e1000 network cards, and a single USB
and RJ45 serial port with standard pinout. It has a PCIe connection to a switch board in the
front of the chassis, further more it's equipped with 8GB of RAM in an SO-DIMM, and its CPU is a
two core Celeron(R) CPU 1047UE @ 1.40GHz.
1. the silicon used in this switch is called _Spectrum_ and identifies itself in Linux as PCI
device `03:00.0` called _Mellanox Technologies MT52100_, so the front dataplane with 32x100G is
separated from the Linux based controlplane.
{{< image width="400px" float="right" src="/assets/mlxsw/sn2700-3.png" alt="SN2700" >}}
When turning on the device, the serial port comes to life and shows me a BIOS, quickly after which
it jumps into GRUB2 and wants me to install it using something called ONIE. I've heard of that, but
now it's time for me to learn a little bit more about that stuff. I ask around and there's plenty of
ONIE images for this particular type of chip to be found - some are open source, some are semi-open
source (as in: were once available but now are behind paywalls etc).
Before messing around with the switch and possibly locking myself out or bricking it, I take out the
16GB mSATA and make a copy of it for safe keeping. I feel somewhat invincible by doing this. How bad
could I mess up this switch, if I can just copy back a bitwise backup of the 16GB mSATA? I'm about
to find out, so read on!
## Software
The Mellanox SN2700 switch is an ONIE (Open Network Install Environment) based platform that
supports a multitude of operating systems, as well as utilizing the advantages of Open Ethernet and
the capabilities of the Mellanox Spectrum® ASIC. The SN2700 has three modes of operation:
* Preinstalled with Mellanox Onyx (successor to MLNX-OS Ethernet), a home-grown operating system
utilizing common networking user experiences and industry standard CLI.
* Preinstalled with Cumulus Linux, a revolutionary operating system taking the Linux user experience
from servers to switches and providing a rich routing functionality for large scale applications.
* Provided with a bare ONIE image ready to be installed with the aforementioned or other ONIE-based
operating systems.
I asked around a bit more and found that there's a few more things one might do with this switch.
One of them is [[SONiC](https://github.com/sonic-net/SONiC)], which stands for _Software for Open
Networking in the Cloud_, and has support for the _Spectrum_ and notably the _SN2700_ switch. Cool!
I also learned about [[DENT](https://dent.dev)], which utilizes the Linux Kernel, Switchdev,
and other Linux based projects as the basis for building a new standardized network operating system
without abstractions or overhead. Unfortunately, while the _Spectrum_ chipset is known to DENT, this
particular layout on SN2700 is not supported.
Finally, my buddy `fall0ut` said "why not just Debian with switchdev?" and now my eyes opened wide.
I had not yet come across [[switchdev](https://docs.kernel.org/networking/switchdev.html)], which is
a standard Linux kernel driver model for switch devices which offload the forwarding (data)plane
from the kernel. As it turns out, Mellanox did a really good job writing a switchdev implementation
in the [[linux kernel](https://github.com/torvalds/linux/tree/master/drivers/net/ethernet/mellanox/mlxsw)]
for the Spectrum series of silicon, and it's all upstreamed to the Linux kernel. Wait, what?!
### Mellanox Switchdev
I start by reading the [[brochure](/assets/mlxsw/PB_Spectrum_Linux_Switch.pdf)], which shows me the
intentions Mellanox had when designing and marketing these switches. It seems that they really meant
it when they said this thing is a fully customizable Linux switch, check out this paragraph:
> Once the Mellanox Switchdev driver is loaded into the Linux Kernel, each
> of the switchs physical ports is registered as a net_device within the
> kernel. Using standard Linux tools (for example, bridge, tc, iproute), ports
> can be bridged, bonded, tunneled, divided into VLANs, configured for L3
> routing and more. Linux switching and routing tables are reflected in the
> switch hardware. Network traffic is then handled directly by the switch.
> Standard Linux networking applications can be natively deployed and
> run on switchdev. This may include open source routing protocol stacks,
> such as Quagga, Bird and XORP, OpenFlow applications, or user-specific
> implementations.
### Installing Debian on SN2700
.. they had me at Bird :) so off I go, to install a vanilla Debian AMD64 Bookworm on a 120G mSATA I
had laying around. After installing it, I noticed that the coveted `mlxsw` driver is not shipped by
default on the Linux kernel image in Debian, so I decide to build my own, letting the [[Debian
docs](https://wiki.debian.org/BuildADebianKernelPackage)] take my hand and guide me through it.
I find a reference on the Mellanox [[GitHub
wiki](https://github.com/Mellanox/mlxsw/wiki/Installing-a-New-Kernel)] which shows me which kernel
modules to include to successfully use the _Spectrum_ under Linux, so I think I know what to do:
```
pim@summer:/usr/src$ sudo apt-get install build-essential linux-source bc kmod cpio flex \
libncurses5-dev libelf-dev libssl-dev dwarves bison
pim@summer:/usr/src$ sudo apt install linux-source-6.1
pim@summer:/usr/src$ sudo tar xf linux-source-6.1.tar.xz
pim@summer:/usr/src$ cd linux-source-6.1/
pim@summer:/usr/src/linux-source-6.1$ sudo cp /boot/config-6.1.0-12-amd64 .config
pim@summer:/usr/src/linux-source-6.1$ cat << EOF | sudo tee -a .config
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE_DEMUX=m
CONFIG_NET_IPGRE=m
CONFIG_IPV6_GRE=m
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_BRIDGE=m
CONFIG_VLAN_8021Q=m
CONFIG_BRIDGE_VLAN_FILTERING=y
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_NET_SWITCHDEV=y
CONFIG_NET_DEVLINK=y
CONFIG_MLXFW=m
CONFIG_MLXSW_CORE=m
CONFIG_MLXSW_CORE_HWMON=y
CONFIG_MLXSW_CORE_THERMAL=y
CONFIG_MLXSW_PCI=m
CONFIG_MLXSW_I2C=m
CONFIG_MLXSW_MINIMAL=y
CONFIG_MLXSW_SWITCHX2=m
CONFIG_MLXSW_SPECTRUM=m
CONFIG_MLXSW_SPECTRUM_DCB=y
CONFIG_LEDS_MLXCPLD=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS=y
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_CLS_MATCHALL=m
CONFIG_NET_CLS_FLOWER=m
CONFIG_NET_ACT_GACT=m
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_SAMPLE=m
CONFIG_NET_ACT_VLAN=m
CONFIG_NET_L3_MASTER_DEV=y
CONFIG_NET_VRF=m
EOF
pim@summer:/usr/src/linux-source-6.1$ sudo make menuconfig
pim@summer:/usr/src/linux-source-6.1$ sudo make -j`nproc` bindeb-pkg
```
I run a gratuitous `make menuconfig` after adding all those config statements to the end of the
`.config` file, and it figures out how to combine what I wrote before with what was in the file
earlier, and I used the standard Bookworm 6.1 kernel config that came from the default installer, so
that it would be a minimal diff to what Debian itself shipped with.
After Summer stretches her legs a bit compiling this kernel for me, look at the result:
```
pim@summer:/usr/src$ dpkg -c linux-image-6.1.55_6.1.55-4_amd64.deb | grep mlxsw
drwxr-xr-x root/root 0 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/
-rw-r--r-- root/root 414897 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_core.ko
-rw-r--r-- root/root 19721 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_i2c.ko
-rw-r--r-- root/root 31817 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_minimal.ko
-rw-r--r-- root/root 65161 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_pci.ko
-rw-r--r-- root/root 1425065 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_spectrum.ko
```
Good job, Summer! On my mSATA disk, I tell Linux to boot its kernel using the following in GRUB,
which will make the kernel not create spiffy interface names like `enp6s0` or `eno1` but just
enumerate them all one by one and call them `eth0` and so on:
```
pim@fafo:~$ grep GRUB_CMDLINE /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 net.ifnames=0 biosdevname=0"
```
## Mellanox SN2700 running Debian+Switchdev
{{< image width="400px" float="right" src="/assets/mlxsw/debian.png" alt="Debian" >}}
I insert the freshly installed Debian Bookworm with custom compiled 6.1.55+mlxsw kernel into the
switch, and it boots on the first try. I see 34 (!) ethernet ports, noting that the first two come
from an Intel NIC but carrying a MAC address from Mellanox (starting with `0c:42:a1`) and the other
32 have a common MAC address (from Mellanox, starting with `04:3f:72`), and what I noticed is that
the MAC addresses here are skipping one between subsequent ports, which leads me to believe that
these 100G ports can be split into two (perhaps 2x50G, 2x40G, 2x25G, 2x10G, which I intend to find
out later). According to the official spec sheet, the switch allows 2-way breakout ports as well as
converter modules, to insert for example a 25G SFP28 into a QSFP28 switchport.
Honestly, I did not think I would get this far, so I humorously (at least, I think so) decide to
call this switch [[FAFO](https://www.urbandictionary.com/define.php?term=FAFO)].
First off, the `mlxsw` driver loaded:
```
root@fafo:~# lsmod | grep mlx
mlxsw_spectrum 708608 0
mlxsw_pci 36864 1 mlxsw_spectrum
mlxsw_core 217088 2 mlxsw_pci,mlxsw_spectrum
mlxfw 36864 1 mlxsw_core
vxlan 106496 1 mlxsw_spectrum
ip6_tunnel 45056 1 mlxsw_spectrum
objagg 53248 1 mlxsw_spectrum
psample 20480 1 mlxsw_spectrum
parman 16384 1 mlxsw_spectrum
bridge 311296 1 mlxsw_spectrum
```
I run `sensors-detect` and `pwmconfig`, let the fans calibrate and write their config file. The fans
come back down to a more chill (pun intended) speed, and I take a closer look. It seems all fans and
all thermometers, including the ones in the QSFP28 cages and the _Spectrum_ switch ASIC are
accounted for:
```
root@fafo:~# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +30.0°C (high = +87.0°C, crit = +105.0°C)
Core 0: +29.0°C (high = +87.0°C, crit = +105.0°C)
Core 1: +30.0°C (high = +87.0°C, crit = +105.0°C)
acpitz-acpi-0
Adapter: ACPI interface
temp1: +27.8°C (crit = +106.0°C)
temp2: +29.8°C (crit = +106.0°C)
mlxsw-pci-0300
Adapter: PCI adapter
fan1: 6239 RPM
fan2: 5378 RPM
fan3: 6268 RPM
fan4: 5378 RPM
fan5: 6326 RPM
fan6: 5442 RPM
fan7: 6268 RPM
fan8: 5315 RPM
temp1: +37.0°C (highest = +41.0°C)
front panel 001: +23.0°C (crit = +73.0°C, emerg = +75.0°C)
front panel 002: +24.0°C (crit = +73.0°C, emerg = +75.0°C)
front panel 003: +23.0°C (crit = +73.0°C, emerg = +75.0°C)
front panel 004: +26.0°C (crit = +73.0°C, emerg = +75.0°C)
...
```
From the top, first I see the classic CPU core temps, then an `ACPI` interface which I'm not quite
sure I understand the purpose of (possibly motherboard, but not PSU because pulling one out does not
change any values). Finally, the sensors using driver `mlxsw-pci-0300`, are those on the switch PCB
carrying the _Spectrum_ silicon, and there's a thermometer for each of the QSFP28 cages, possibly
reading from the optic, as most of them are empty except the first four which I inserted optics to.
Slick!
{{< image width="500px" float="right" src="/assets/mlxsw/debian-ethernet.png" alt="Ethernet" >}}
I notice that the ports are in a bit of a weird order. Firstly, eth0-1 are the two 1G ports on the
Debian machine. But then, the rest of the ports are the Mellanox Spectrum ASIC:
* eth2-17 correspond to port 17-32, which seems normal, but
* eth18-19 correspond to port 15-16
* eth20-21 correspond to port 13-14
* eth30-31 correspond to port 3-4
* eth32-33 correspond to port 1-2
The switchports are actually sequentially numbered with respect to MAC addresses, with `eth2`
starting at `04:3f:72:74:a9:41` and finally `eth34` having `04:3f:72:74:a9:7f` (for 64 consecutive
MACs).
Somehow though, the ports are wired in a different way on the front panel. As it turns out, I can
insert a little `udev` ruleset that will take care of this:
```
root@fafo:~# cat << EOF > /etc/udev/rules.d/10-local.rules
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="mlxsw_spectrum*", \
NAME="sw$attr{phys_port_name}"
EOF
```
After rebooting the switch, the ports are now called `swp1` .. `swp32` and they also correspond with
their physical ports on the front panel. One way to check this, is using `ethtool --identify swp1`
which will blink the LED of port 1, until I press ^C. Nice.
### Debian SN2700: Diagnostics
The first thing I'm curious to try, is if _Link Layer Discovery Protocol_
[[LLDP](https://en.wikipedia.org/wiki/Link_Layer_Discovery_Protocol)] works. This is a
vendor-neutral protocol that network devices use to advertise their identity to peers over Ethernet.
I install an open source LLDP daemon and plug in a DAC from port1 to a Centec switch in the lab.
And indeed, quickly after that, I see two devices, the first on the Linux machine `eth0` which is
the Unifi switch that has my LAN, and the second is the Centec behind `swp1`:
```
root@fafo:~# apt-get install lldpd
root@fafo:~# lldpcli show nei summary
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface: eth0, via: LLDP
Chassis:
ChassisID: mac 44:d9:e7:05:ff:46
SysName: usw6-BasementServerroom
Port:
PortID: local Port 9
PortDescr: fafo.lab
TTL: 120
Interface: swp1, via: LLDP
Chassis:
ChassisID: mac 60:76:23:00:01:ea
SysName: sw3.lab
Port:
PortID: ifname eth-0-25
PortDescr: eth-0-25
TTL: 120
```
With this I learn that the switch forwards these datagrams (ethernet type `0x88CC`) from the
dataplane to the Linux controlplane. I would call this _punting_ in VPP language, but switchdev
calls it _trapping_, and I can see the LLDP packets when tcpdumping on ethernet device `swp1`.
So today I learned how to trap packets :-)
### Debian SN2700: ethtool
One popular diagnostics tool that is useful (and, hopefully well known because it's awesome),
is`ethtool`, a command-line tool in Linux for managing network interface devices. It allows me to
modify the parameters of the ports and their transceivers, as well as query the information of those
devices.
Here are few common examples, all of which work on this switch running Debian:
* `ethtool swp1`: Shows link capabilities (eg, 1G/10G/25G/40G/100G)
* `ethtool -s swp1 speed 40000 duplex full autoneg off`: Force speed/duplex
* `ethtool -m swp1`: Shows transceiver diagnostics like SFP+ light levels, link levels
(also `--module-info`)
* `ethtool -p swp1`: Flashes the transceiver port LED (also `--identify`)
* `ethtool -S swp1`: Shows packet and octet counters, and sizes, discards, errors, and so on
(also `--statistics`)
I specifically love the _digital diagnostics monitoring_ (DDM), originally specified in
[[SFF-8472](https://members.snia.org/document/dl/25916)], which allows me to read the EEPROM of
optical transceivers and get all sorts of critical diagnostics. I wish DPDK and VPP had that!
### Debian SN2700: devlink
In reading up on the _switchdev_ ecosystem, I stumbled across `devlink`, an API to expose device
information and resources not directly related to any device class, such as switch ASIC
configuration. As a fun fact, devlink was written by the same engineer who wrote the `mlxsw` driver
for Linux, Jiří Pírko. Its documentation can be found in the [[linux
kernel](https://docs.kernel.org/networking/devlink/index.html)], and it ships with any modern
`iproute2` distribution. The specific (somewhat terse) documentation of the `mlxsw` driver
[[lives there](https://docs.kernel.org/networking/devlink/mlxsw.html)] as well.
There's a lot to explore here, but I'll focus my attention to three things:
#### 1. devlink resource
When learning that the switch also does IPv4 and IPv6 routing, I immediately thought: how many
prefixes can be offloaded to the ASIC? One way to find out is to query what types of _resources_ it
has:
```
root@fafo:~# devlink resource show pci/0000:03:00.0
pci/0000:03:00.0:
name kvd size 258048 unit entry dpipe_tables none
resources:
name linear size 98304 occ 1 unit entry size_min 0 size_max 159744 size_gran 128 dpipe_tables none
resources:
name singles size 16384 occ 1 unit entry size_min 0 size_max 159744 size_gran 1 dpipe_tables none
name chunks size 49152 occ 0 unit entry size_min 0 size_max 159744 size_gran 32 dpipe_tables none
name large_chunks size 32768 occ 0 unit entry size_min 0 size_max 159744 size_gran 512 dpipe_tables none
name hash_double size 65408 unit entry size_min 32768 size_max 192512 size_gran 128 dpipe_tables none
name hash_single size 94336 unit entry size_min 65536 size_max 225280 size_gran 128 dpipe_tables none
name span_agents size 3 occ 0 unit entry dpipe_tables none
name counters size 32000 occ 4 unit entry dpipe_tables none
resources:
name rif size 8192 occ 0 unit entry dpipe_tables none
name flow size 23808 occ 4 unit entry dpipe_tables none
name global_policers size 1000 unit entry dpipe_tables none
resources:
name single_rate_policers size 968 occ 0 unit entry dpipe_tables none
name rif_mac_profiles size 1 occ 0 unit entry dpipe_tables none
name rifs size 1000 occ 1 unit entry dpipe_tables none
name physical_ports size 64 occ 36 unit entry dpipe_tables none
```
There's a lot to unpack here, but this is a tree of resources, each with names and children. Let me
focus on the first one, called `kvd`, which stands for _Key Value Database_ (in other words, a set
of lookup tables). It contains a bunch of children called `linear`, `hash_double` and `hash_single`.
The kernel [[docs](https://www.kernel.org/doc/Documentation/ABI/testing/devlink-resource-mlxsw)]
explain it in more detail, but this is where the switch will keep its FIB in _Content Addressable
Memory_ (CAM) of certain types of elements of a given length and count. All up, the size is 252KB,
which is not huge, but also certainly not tiny!
Here I learn that it's subdivided into:
* ***linear***: 96KB bytes of flat memory using an index, further divided into regions:
* ***singles***: 16KB of size 1, nexthops
* ***chunks***: 48KB of size 32, multipath routes with <32 entries
* ***large_chunks***: 32KB of size 512, multipath routes with <512 entries
* ***hash_single***: 92KB bytes of hash table for keys smaller than 64 bits (eg. L2 FIB, IPv4 FIB and
neighbors)
* ***hash_double***: 63KB bytes of hash table for keys larger than 64 bits (eg. IPv6 FIB and
neighbors)
#### 2. devlink dpipe
Now that I know the memory layout and regions of the CAM, I can start making some guesses on the FIB
size. The devlink pipeline debug API (DPIPE) is aimed at providing the user visibility into the
ASIC's pipeline in a generic way. The API is described in detail in the [[kernel
docs](https://docs.kernel.org/networking/devlink/devlink-dpipe.html)]. I feel free to take a peek at
the dataplane configuration innards:
```
root@fafo:~# devlink dpipe table show pci/0000:03:00.0
pci/0000:03:00.0:
name mlxsw_erif size 1000 counters_enabled false
match:
type field_exact header mlxsw_meta field erif_port mapping ifindex
action:
type field_modify header mlxsw_meta field l3_forward
type field_modify header mlxsw_meta field l3_drop
name mlxsw_host4 size 0 counters_enabled false resource_path /kvd/hash_single resource_units 1
match:
type field_exact header mlxsw_meta field erif_port mapping ifindex
type field_exact header ipv4 field destination ip
action:
type field_modify header ethernet field destination mac
name mlxsw_host6 size 0 counters_enabled false resource_path /kvd/hash_double resource_units 2
match:
type field_exact header mlxsw_meta field erif_port mapping ifindex
type field_exact header ipv6 field destination ip
action:
type field_modify header ethernet field destination mac
name mlxsw_adj size 0 counters_enabled false resource_path /kvd/linear resource_units 1
match:
type field_exact header mlxsw_meta field adj_index
type field_exact header mlxsw_meta field adj_size
type field_exact header mlxsw_meta field adj_hash_index
action:
type field_modify header ethernet field destination mac
type field_modify header mlxsw_meta field erif_port mapping ifindex
```
From this I can puzzle together how the CAM is _actually_ used:
* ***mlxsw_host4***: matches on the interface port and IPv4 destination IP, using `hash_single`
above with one unit for each entry, and when looking that up, puts the result into the ethernet
destination MAC (in other words, the FIB entry points at an L2 nexthop!)
* ***mlxsw_host6***: matches on the interface port and IPv6 destination IP using `hash_double`
with two units for each entry.
* ***mlxsw_adj***: holds the L2 adjacencies, and the lookup key is an index, size and hash index,
where the returned value is used to rewrite the destination MAC and select the egress port!
Now that I know the types of tables and what they are matching on (and then which action they are
performing), I can also take a look at the _actual data_ in the FIB. For example, if I create an IPv4
interface on the switch and ping a member on directly connected network there, I can see an entry
show up in the L2 adjacency table, like so:
```
root@fafo:~# ip addr add 100.65.1.1/30 dev swp31
root@fafo:~# ping 100.65.1.2
root@fafo:~# devlink dpipe table dump pci/0000:03:00.0 name mlxsw_host4
pci/0000:03:00.0:
index 0
match_value:
type field_exact header mlxsw_meta field erif_port mapping ifindex mapping_value 71 value 1
type field_exact header ipv4 field destination ip value 100.65.1.2
action_value:
type field_modify header ethernet field destination mac value b4:96:91:b3:b1:10
```
To decypher what the switch is doing: if the ifindex is `71` (which corresponds to `swp31`), and the
IPv4 destination IP address is `100.65.1.2`, then the destination MAC address will be set to
`b4:96:91:b3:b1:10`, so the switch knows where to send this ethernet datagram.
And now I have found what I need to know to be able to answer the question of the FIB size. This
switch can take **92K IPv4 routes** and **31.5K IPv6 routes**, and I can even inspect the FIB in great
detail. Rock on!
#### 3. devlink port split
But reading the switch chip configuration and FIB is not all that `devlink` can do, it can also make
changes! One particularly interesting one is the ability to split and unsplit ports. What this means
is that, when you take a 100Gbit port, it internally is divided into four so-called _lanes_ of
25Gbit each, where a 40Gbit port is internally divided into four _lanes_ of 10Gbit each. Splitting
ports is the act of taking such a port and reconfiguring its lanes.
Let me show you, by means of example, what spliting the first two switchports might look like. They
begin their life as 100G ports, which support a number of link speeds, notably: 100G, 50G, 25G, but
also 40G, 10G, and finally 1G:
```
root@fafo:~# ethtool swp1
Settings for swp1:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseKX/Full
10000baseKR/Full
40000baseCR4/Full
40000baseSR4/Full
40000baseLR4/Full
25000baseCR/Full
25000baseSR/Full
50000baseCR2/Full
100000baseSR4/Full
100000baseCR4/Full
100000baseLR4_ER4/Full
root@fafo:~# devlink port show | grep 'swp[12] '
pci/0000:03:00.0/61: type eth netdev swp1 flavour physical port 1 splittable true lanes 4
pci/0000:03:00.0/63: type eth netdev swp2 flavour physical port 2 splittable true lanes 4
root@fafo:~# devlink port split pci/0000:03:00.0/61 count 4
[ 629.593819] mlxsw_spectrum 0000:03:00.0 swp1: link down
[ 629.722731] mlxsw_spectrum 0000:03:00.0 swp2: link down
[ 630.049709] mlxsw_spectrum 0000:03:00.0: EMAD retries (1/5) (tid=64b1a5870000c726)
[ 630.092179] mlxsw_spectrum 0000:03:00.0 swp1s0: renamed from eth2
[ 630.148860] mlxsw_spectrum 0000:03:00.0 swp1s1: renamed from eth2
[ 630.375401] mlxsw_spectrum 0000:03:00.0 swp1s2: renamed from eth2
[ 630.375401] mlxsw_spectrum 0000:03:00.0 swp1s3: renamed from eth2
root@fafo:~# ethtool swp1s0
Settings for swp1s0:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseSR/Full
```
Whoa, what just happened here? The switch took the port defined by `pci/0000:03:00.0/61` which says
it is _splittable_ and has four lanes, and split it into four NEW ports called `swp1s0`-`swp1s3`,
and the resulting ports are 25G, 10G or 1G.
{{< image width="100px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
However, I make an important observation. When splitting `swp1` in 4, the switch also removed port
`swp2`, and remember at the beginning of this article I mentioned that the MAC addresses seemed to
skip one entry between subsequent interfaces? Now I understand why: when spltting the port into two,
it will use the second MAC address for the second 50G port; but if I split it into four, it'll use
the MAC addresses from the adjacent port and decommission it. In other words: this switch can do
32x100G, or 64x50G, or 64x25G/10G/1G.
It doesn't matter which of the PCI interfaces I split on. The operation is also reversible, I can
issue `devlink port unsplit` to return the port to its aggregate state (eg. 4 lanes and 100Gbit),
which will remove the `swp1s0-3` ports and put back `swp1` and `swp2` again.
What I find particularly impressive about this, is that for most hardware vendors, this splitting of
ports requires a reboot of the chassis, while here it can happen entirely online. Well done,
Mellanox!
## Performance
OK, so this all seems to work, but does it work _well_? If you're a reader of my blog you'll know
that I love doing loadtests, so I boot my machine, Hippo, and I connect it with two 100G DACs to the
switch on ports 31 and 32:
```
[ 1.354802] ice 0000:0c:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
[ 1.447677] ice 0000:0c:00.0: firmware: direct-loading firmware intel/ice/ddp/ice.pkg
[ 1.561979] ice 0000:0c:00.1: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
[ 7.738198] ice 0000:0c:00.0 enp12s0f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC,
Negotiated FEC: RS-FEC, Autoneg Advertised: On, Autoneg Negotiated: True, Flow Control: None
[ 7.802572] ice 0000:0c:00.1 enp12s0f1: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC,
Negotiated FEC: RS-FEC, Autoneg Advertised: On, Autoneg Negotiated: True, Flow Control: None
```
I hope you're hungry, Hippo, cuz you're about to get fed!
### Debian SN2700: L2
To use the switch in L2 mode, I intuitively create Linux bridge, say `br0`, and add ports to that.
From the Mellanox documentation I learn that there can be multiple bridges, each isolated from one
another, but there can only be _one_ such bridge with `vlan_filtering` set. VLAN Filtering allows
the switch to only accept tagged frames from a list of _configured_ VLANs, and drop the rest. This
is what you'd imagine a regular commercial switch would provide.
So off I go, creating the bridge in which I'll add two ports (HundredGigabitEthernet port swp31 and
swp32), and I will allow for the maximum MTU size of 9216, also known as [[Jumbo Frames](https://en.wikipedia.org/wiki/Jumbo_frame)].
```
root@fafo:~# ip link add name br0 type bridge
root@fafo:~# ip link set br0 type bridge vlan_filtering 1 mtu 9216 up
root@fafo:~# ip link set swp31 mtu 9216 master br0 up
root@fafo:~# ip link set swp32 mtu 9216 master br0 up
```
These two ports are now _access_ ports, that is to say they accept and emit only untagged traffic,
and due to the `vlan_filtering` flag, they will drop all other frames. Using the standard `bridge`
utility from Linux, I can manipulate the VLANs on these ports.
First, I'll remove the _default VLAN_ and add VLAN 1234 to both ports, specifying that VLAN 1234 is
the so-called _Port VLAN ID_ (pvid). This makes them the equivalent of Cisco's _switchport access
1234_:
```
root@fafo:~# bridge vlan del vid 1 dev swp1
root@fafo:~# bridge vlan del vid 1 dev swp2
root@fafo:~# bridge vlan add vid 1234 dev swp1 pvid
root@fafo:~# bridge vlan add vid 1234 dev swp2 pvid
```
Then, I'll add a few tagged VLANs to the ports, so that they become the Cisco equivalent of a
_trunk_ port allowing these tagged VLANs and assuming untagged traffic is still VLAN 1234:
```
root@fafo:~# for port in swp1 swp2; do for vlan in 100 200 300 400; do \
bridge vlan add vid $vlan dev $port; done; done
root@fafo:~# bridge vlan
port vlan-id
swp1 100 200 300 400
1234 PVID
swp2 100 200 300 400
1234 PVID
br0 1 PVID Egress Untagged
```
When these commands are run against the interfaces `swp*`, they are picked up by the `mlxsw` kernel
driver, and transmitted to the _Spectrum_ switch chip, in other words, these commands end up
programming the silicon. Traffic through these switch ports on the front, rarely (if ever) get
forwarded to the Linux kernel, very similar to [[VPP](https://fd.io/)], the traffic stays mostly in
the dataplane. Some traffic, such as LLDP (and as we'll see later, IPv4 ARP and IPv6 neighbor
discovery), will be forwarded from the switch chip over the PCIe link to the kernel, after which the
results are transmitted back via PCIe to program the switch chip L2/L3 _Forwarding Information Base_
(FIB).
Now I turn my attention to the loadtest, by configuring T-Rex in L2 Stateless mode. I start a
bidirectional loadtest with 256b packets at 50% of line rate, which looks just fine:
{{< image src="/assets/mlxsw/trex1.png" alt="Trex L2" >}}
At this point I can already conclude that this is all happening in the dataplane, as the _Spectrum_
switch is connected to the Debian machine using a PCIe v3.0 x8 link, which is even obscured by
another device on the PCIe bus, so the Debian kernel is in no way able to process more than a token
amount of traffic, and yet I'm seeing 100Gbit go through the switch chip and the CPU load on the
kernel pretty much zero. I can however retrieve the link statistics using `ip stats`, and those will
show me the actual counters of the silicon, not _just_ the trapped packets. If you'll recall, in VPP
the only packets that the TAP interfaces see are those packets that are _punted_, and the Linux
kernel there is completely oblivious to the total dataplane throughput. Here, the interface is
showing the correct dataplane packet and byte counters, which means that things like SNMP will
automatically just do the right thing.
```
root@fafo:~# dmesg | grep 03:00.*bandwidth
[ 2.180410] pci 0000:03:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s
PCIe x4 link at 0000:00:01.2 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link)
root@fafo:~# uptime
03:19:16 up 2 days, 14:14, 1 user, load average: 0.00, 0.00, 0.00
root@fafo:~# ip stats show dev swp32 group link
72: swp32: group link
RX: bytes packets errors dropped missed mcast
5106713943502 15175926564 0 0 0 103
TX: bytes packets errors dropped carrier collsns
23464859508367 103495791750 0 0 0 0
```
### Debian SN2700: IPv4 and IPv6
I now take a look at the L3 capabilities of the switch. To do this, I simply destroy the bridge
`br0`, which will return the enslaved switchports. I then convert the T-Rex loadtester to use an L3
profile, and configure the switch as follows:
```
root@fafo:~# ip addr add 100.65.1.1/30 dev swp31
root@fafo:~# ip nei replace 100.65.1.2 lladdr b4:96:91:b3:b1:10 dev swp31
root@fafo:~# ip ro add 16.0.0.0/8 via 100.65.1.2 dev swp31
root@fafo:~# ip addr add 100.65.2.1/30 dev swp32
root@fafo:~# ip nei replace 100.65.2.2 lladdr b4:96:91:b3:b1:11 dev swp32
root@fafo:~# ip ro add 48.0.0.0/8 via 100.65.2.2 dev swp32
```
Several other routers I've loadtested have the same (cosmetic) issue, that T-Rex doesn't reply to
ARP packets after the first few seconds, so I first set the IPv4 address, then add a static L2
adjacency for the T-Rex side (on MAC `b4:96:91:b3:b1:10`), and route 16.0.0.0/8 to port 0 and I
route 48.0.0.0/8 to port 1 of the loadtester.
{{< image src="/assets/mlxsw/trex2.png" alt="Trex L3" >}}
I start a stateless L3 loadtest with 192 byte packets in both directions, and the switch keeps up
just fine. Taking a closer look at the `ip stats` instrumentation, I see that there's the ability to
turn on L3 counters in addition to L2 (ethernet) counters. So I do that on my two router ports while
they are happily forwarding 58.9Mpps, and I can now see the difference between dataplane (forwarded in
hardware) and CPU (forwarded by the CPU)
```
root@fafo:~# ip stats set dev swp31 l3_stats on
root@fafo:~# ip stats set dev swp32 l3_stats on
root@fafo:~# ip stats show dev swp32 group offload subgroup l3_stats
72: swp32: group offload subgroup l3_stats on used on
RX: bytes packets errors dropped mcast
270222574848200 1137559577576 0 0 0
TX: bytes packets errors dropped
281073635911430 1196677185749 0 0
root@fafo:~# ip stats show dev swp32 group offload subgroup cpu_hit
72: swp32: group offload subgroup cpu_hit
RX: bytes packets errors dropped missed mcast
1068742 17810 0 0 0 0
TX: bytes packets errors dropped carrier collsns
468546 2191 0 0 0 0
```
The statistics above clearly demonstrate that the lion's share of the packets have been forwarded by
the ASIC, and only a few (notably things like IPv6 neighbor discovery, IPv4 ARP, LLDP, and of course
any traffic _to_ the IP addresses configured on the router) will go to the kernel.
#### Debian SN2700: BVI (or VLAN Interfaces)
I've played around a little bit with L2 (switch) and L3 (router) ports, but there is one middle
ground. I'll keep the T-Rex loadtest running in L3 mode, but now I'll reconfigure the switch to put
the ports back into the bridge, each port in its own VLAN, and have so-called _Bridge Virtual
Interface_, also known as VLAN interfaces -- this is where the switch has a bunch of ports together
in a VLAN, but the switch itself has an IPv4 or IPv6 address in that VLAN as well, which can act as
a router.
I reconfigure the switch to put the interfaces back into VLAN 1000 and 2000 respectively, and move
the IPv4 addresses and routes there -- so here I go, first putting the switch interfaces back into
L2 mode and adding them to the bridge, each in their own VLAN, by making them access ports:
```
root@fafo:~# ip link add name br0 type bridge vlan_filtering 1
root@fafo:~# ip link set br0 address 04:3f:72:74:a9:7d mtu 9216 up
root@fafo:~# ip link set swp31 master br0 mtu 9216 up
root@fafo:~# ip link set swp32 master br0 mtu 9216 up
root@fafo:~# bridge vlan del vid 1 dev swp31
root@fafo:~# bridge vlan del vid 1 dev swp32
root@fafo:~# bridge vlan add vid 1000 dev swp31 pvid
root@fafo:~# bridge vlan add vid 2000 dev swp32 pvid
```
From the ASIC specs, I understand that these BVIs need to (re)use a MAC from one of the members, so
the first thing I do is give `br0` the right MAC address. Then I put the switch ports into the bridge,
remove VLAN 1 and put them in their respective VLANs. At this point, the loadtester reports 100%
packet loss, because the two ports can no longer see each other at layer2, and layer3 configs have
been removed. But I can restore connectivity with two _BVIs_ as follows:
```
root@fafo:~# for vlan in 1000 2000; do
ip link add link br0 name br0.$vlan type vlan id $vlan
bridge vlan add dev br0 vid $vlan self
ip link set br0.$vlan up mtu 9216
done
root@fafo:~# ip addr add 100.65.1.1/24 dev br0.1000
root@fafo:~# ip ro add 16.0.0.0/8 via 100.65.1.2
root@fafo:~# ip nei replace 100.65.1.2 lladdr b4:96:91:b3:b1:10 dev br0.1000
root@fafo:~# ip addr add 100.65.2.1/24 dev br0.2000
root@fafo:~# ip ro add 48.0.0.0/8 via 100.65.2.2
root@fafo:~# ip nei replace 100.65.2.2 lladdr b4:96:91:b3:b1:11 dev br0.2000
```
And with that, the loadtest shoots back in action:
{{< image src="/assets/mlxsw/trex3.png" alt="Trex L3 BVI" >}}
First a quick overview of the sitation I have created:
```
root@fafo:~# bridge vlan
port vlan-id
swp31 1000 PVID
swp32 2000 PVID
br0 1 PVID Egress Untagged
root@fafo:~# ip -4 ro
default via 198.19.5.1 dev eth0 onlink rt_trap
16.0.0.0/8 via 100.65.1.2 dev br0.1000 offload rt_offload
48.0.0.0/8 via 100.65.2.2 dev br0.2000 offload rt_offload
100.65.1.0/24 dev br0.1000 proto kernel scope link src 100.65.1.1 rt_offload
100.65.2.0/24 dev br0.2000 proto kernel scope link src 100.65.2.1 rt_offload
198.19.5.0/26 dev eth0 proto kernel scope link src 198.19.5.62 rt_trap
root@fafo:~# ip -4 nei
198.19.5.1 dev eth0 lladdr 00:1e:08:26:ec:f3 REACHABLE
100.65.1.2 dev br0.1000 lladdr b4:96:91:b3:b1:10 offload PERMANENT
100.65.2.2 dev br0.2000 lladdr b4:96:91:b3:b1:11 offload PERMANENT
```
Looking at the situation now, compared to the regular IPv4 L3 loadtest, there is one important
difference. Now, the switch can have any number of ports in VLAN 1000, which will all amongst
themselves do L2 forwarding at line rate, and when they need to send IPv4 traffic out, they will ARP
for the gateway (for example at `100.65.1.1/24`), which will get _trapped_ and forwarded to the CPU,
after which the ARP reply will go out so that the machines know where to find the gateway. From that
point on, IPv4 forwarding happens once again in hardware, which can be shown by the keywords
`rt_offload` in the routing table (`br0`, in the ASIC), compared to the `rt_trap` (`eth0`, in the
kernel). Similarly for the IPv4 neighbors, the L2 adjacency is programmed into the CAM (the output
of which I took a look at above), do forwarding can be done directly by the ASIC without
intervention from the CPU.
As a result, these _VLAN Interfaces_ (which are synonymous with _BVIs_), work at line rate out of
the box.
### Results
This switch is phenomenal, and Jiří Pírko and the Mellanox team truly outdid themselves with their `mlxsw`
switchdev implementation. I have in my hands a very affordable 32x100G or 64x(50G, 25G, 10G, 1G) and
anything in between, with IPv4 and IPv6 forwarding in hardware, with a limited FIB size, not too
dissimilar from the [[Centec]({% post_url 2022-12-09-oem-switch-2 %})] switches that IPng Networks
runs in its AS8298 network, albeit without MPLS forwarding capabilities.
Still, for a LAB switch, to better test 25G and 100G topologies, this switch is very good value for
my money spent, and that it runs Debian and is fully configurable with things like Kees and Ansible.
Considering there's a whole range of 48x10G and 48x25G switches as well from Mellanox, all
completely open and _officially allowed_ to run OSS stuff on, these make a perfect fit for IPng
Networks!
### Acknowledgements
This article was written after fussing around and finding out, but a few references were particularly
helpful, and I'd like to acknowledge the following super useful sites:
* [[mlxsw wiki](https://github.com/Mellanox/mlxsw/wiki)] on GitHub
* [[jpirko's kernel driver](https://github.com/jpirko/linux_mlxsw)] on GitHub
* [[SONiC wiki](https://github.com/sonic-net/SONiC/wiki/)] on GitHub
* [[Spectrum Docs](https://www.nvidia.com/en-us/networking/ethernet-switching/spectrum-sn2000/)] on NVIDIA
And to the community for writing and maintaining this excellent switchdev implementation.

View File

@ -0,0 +1,474 @@
---
date: "2023-12-17T13:37:00Z"
title: Debian on IPng's VPP Routers
---
{{< image width="200px" float="right" src="/assets/debian-vpp/debian-logo.png" alt="Debian" >}}
# Introduction
When IPng Networks first built out a european network, I was running the Disaggregated Network
Operating System [[ref](https://www.danosproject.org/)], initially based on AT&Ts “dNOS” software
framework. Over time though, the DANOS project slowed down, and the developers with whom I had a
pretty good relationship all left for greener pastures.
In 2019, Pierre Pfister (and several others) built a VPP _router sandbox_ [[ref](https://wiki.fd.io/view/VPP_Sandbox/router)],
which graduated into a feature called the Linux Control Plane plugin
[[ref](https://s3-docs.fd.io/vpp/22.10/developer/plugins/lcp.html)]. Lots of folks put in an effort
for the Linux Control Plane, notably Neale Ranns from Cisco (these days Graphiant), and Matt Smith
and Jon Loeliger from Netgate (who ship this as TNSR [[ref](https://netgate.com/tnsr)], check it out!).
I helped as well, by adding a bunch of Netlink handling and VPP-&gt;Linux synchronization code,
which I've written about a bunch on this blog in the 2021 VPP development series [[ref]({% post_url
2021-08-12-vpp-1 %})].
At the time, Ubuntu and CentOS were the supported platforms, so I installed a bunch of Ubuntu
machines when doing the deploy with my buddy Fred from IP-Max [[ref](https://ip-max.net)]. But as
time went by, I fell back to my old habit of running Debian on hypervisors and VMs for the services
at IPng Networks. After some time automating mostly everything with Ansible and Kees, I got tired of
those places where I needed branches like `if Ubuntu then ... elif Debian then ... elif OpenBSD
then ... else panic`.
I took stock of the fleet at the end of 2023, and I found the following:
* ***OpenBSD***: 3 virtual machines, bastion jumphosts connected to Internet and IPng Site Local
* ***Ubuntu***: 4 physical machines, VPP routers (`nlams0`, `defra0`, `chplo0` and `usfmt0`)
* ***Debian***: 22 physical machines and 116 virtual machines, running internal and public services,
almost all of these machines are entirely in IPng Site Local [[ref]({% post_url
2023-03-11-mpls-core %})], not connected to the
internet at all.
It became clear to me that I could make a small sprint to standardize all physical hardware on
Debian Bookworm, and move away from Ubuntu LTS. In case you're wondering: there's **nothing wrong with
Ubuntu**, although I will admit I'm not a big fan of `snapd` and `cloud-init` but they are easily
disabled. I guess with the way the situation evolved in AS8298, I ended up running a fair few more Debian
physical (and virtual) machines, so I'll make an executive decision to move to Debian. By the way,
the fun thing about IPng is that being the _Chief of Everything_ (COE), I get to make those calls
unilaterally :)
## Upgrading to Debian
Luckily, I already have a fair number of VPP routers that have been deployed on Debian (mostly
_Bullseye_, but one of them is _Bookworm_), and my LAB environment [[ref]({% post_url
2022-10-14-lab-1 %})] is running Debian Bookworm as well. Although its native habitat is Ubuntu, I
regularly run VPP in a Debian environment, for example when Adrian contributed the MPLS code
[[ref]({% post_url 2023-05-21-vpp-mpls-3 %})], he also recommended Debian 12, because that ships
with a modern libnl which supports a few bits and pieces he needed.
### Preparations
{{< image width="300px" float="right" src="/assets/debian-vpp/defra0.png" alt="Frankfurt" >}}
OK, while my network is not large, it'a also not completely devoid of customers, so instead of a
YOLO, I decide to make an action plan that roughly looks like this:
1. Notify customers of upcoming maintenance
1. For each of the routers to-be-upgraded:
1. Check the borgmatic daily backups
1. Drain traffic away from the router
1. Use IPMI to re-install it remotely
1. Put the VPP, Bird, SSH configs back
1. Undrain the router
1. Drink my advents-calendar tea!
When deploying a datacenter site, I am adamant to have a consistent and dependable environment. At
each site, specifically those that are a bit further away, I deploy a standard issue PCEngines
APU [[ref](https://pcengines.ch/)] with 802.11ac WiFi, serial, and IPMI access to any machine that may be
there. If you ever visit a datacenter floor where I'm present, look for an SSID like `AS8298 FRA` in the
case of the Frankfurt site. The password is `IPngGuest`, you're welcome to some bits of bandwidth in a
pinch :)
You can find the APU in the picture to the right. All the way at the top, you'll see a
small blue machine with two antenna's sticking out. It's connected to my carrier, AS25091's packet
factory Cisco ASR9010, for out of band connectivity. Then, all the way at the bottom, you can see my
Supermicro SYS-5018D-FN8T called `defra0.ipng.ch` paired with a Centec MPLS switch for transport and
breakout ports 😍.
When I installed all of this kit, I did two specific things that will greatly benefit me now:
1. I enabled IPMI KVM and Serial-over-LAN on the Supermicro, so I can reach it over its dedicated
IPMI port, and see what its VGA does. Also, in case anything weird happens to VPP and/or the
Centec switches and IPng Site Local becomes unavailable, I can still log in and take a look via serial.
2. I installed Samba on the APU, which allows me to instruct the IPMI to insert a virtual USB 'stick' by
means of mounting a SAMBA share. This is incredibly useful in scenarios such as this reinstall!
Allthough I do trust it, I would hate to reboot the machine to find that IPMI or serial doesn't
work. So let me make sure that the machine is still good go to:
```
pim@summer:~$ ssh -L 8443:defra0-ipmi:443 cons0.defra0
pim@cons0-defra0:~$ ipmitool -I lanplus -H defra0-ipmi -U ${IPMI_USER} -P ${IPMI_PASS} sol activate
[SOL Session operational. Use ~? for help]
defra0 login:
```
Nice going! Checking the samba configuration, it is super straightforward:
```
pim@cons0-defra0:~$ cat /etc/samba/smb.conf
[global]
workgroup = WINSHARE
server string = Ubuntu Samba %v
netbios name = console
security = user
map to guest = bad user
dns proxy = no
server min protocol = NT1
#============================ Share Definitions ==============================
[share]
path = /var/samba
browsable = yes
writable = no
guest ok = yes
read only = yes
pim@cons0-defra0:/var/samba$ ls -lrt
total 2306000
-rw-r--r-- 1 pim pim 441450496 Feb 10 2021 danos-2012-base-amd64.iso
-rw-r--r-- 1 pim pim 1261371392 Aug 24 2021 ubuntu-20.04.3-live-server-amd64.iso
-rw-r--r-- 1 pim pim 658505728 Dec 17 17:20 debian-12.4.0-amd64-netinst.iso
pim@cons0-defra0:~$ ip -br a
internal UP 172.16.13.1/24 fd25:8c03:9b1c:100d::1/64 fe80::b49b:1cff:feb2:7f2f/64
external UP 46.20.246.50/29 2a02:2528:ff01::2/64 fe80::d8fe:8ff:fe73:8c99/64
wlp4s0 UP 172.16.14.1/24 fd25:8c03:9b1c:100e::1/64 fe80::6f0:21ff:fe9b:562e/64
```
You can see the lifecycle progression on this server. In Feb'21, I installed DANOS 20.12, then
moving to Ubuntu LTS 20.04 around Aug'21, and now it is time to advance once again, this time to
Debian 12.
As a final pre-flight check, while using the port forwarding I set up (`-L` flag above), I will log
in to the IPMI controller remotely, to insert this CD image into the virtual CDROM drive, like so:
{{< image src="/assets/debian-vpp/supermicro-ipmi.png" alt="IPMI" >}}
And indeed, it pops up in the running Ubuntu router:
```
pim@defra0:~$ uname -a
Linux defra0 5.4.0-109-generic #123-Ubuntu SMP Fri Apr 8 09:10:54 UTC 2022 x86_64 GNU/Linux
pim@defra0:~$ uptime
15:51:10 up 600 days, 17:40, 1 user, load average: 3.44, 3.30, 3.31
pim@defra0:~$ dmesg | tail -10
[51852396.194030] usb 2-4.2: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[51852396.215804] usb-storage 2-4.2:1.0: USB Mass Storage device detected
[51852396.215993] scsi host6: usb-storage 2-4.2:1.0
[51852396.216107] usbcore: registered new interface driver usb-storage
[51852396.219915] usbcore: registered new interface driver uas
[51852396.232081] scsi 6:0:0:0: CD-ROM ATEN Virtual CDROM YS0J PQ: 0 ANSI: 0 CCS
[51852396.232475] scsi 6:0:0:0: Attached scsi generic sg1 type 5
[51852396.251038] sr 6:0:0:0: [sr0] scsi3-mmc drive: 40x/40x cd/rw xa/form2 cdda tray
[51852396.251047] cdrom: Uniform CD-ROM driver Revision: 3.20
[51852396.267643] sr 6:0:0:0: Attached scsi CD-ROM sr0
```
I just love it when this stuff works. And it's nice to see the happenstance of the machine being up
for 600 days. Good power, great operating system and awesome hosting provider. Thanks for the service
so far, my sweet little Ubuntu router ❤️ !
## Installing
### Drain
Considering there is live traffic on the network, typically what an operator would do is drain the
links to route around the maintenance. To do this in my case, I need to make two changes, notably
draining OSPF and eBGP.
***OSPF***: In AS8298, all backbone connections use OSPF, and typically traffic from Zurich to Amsterdam
will be over Frankfurt because the OSPF cost is slightly lower than the other way around. I've
decided to standardize the OSPF link cost to be in tenths of milliseconds. In other words, if the
latency from `chrma0` to `defra0` is 5.6 ms, the OSPF cost will be 56. One way for me to avoid using
the Frankfurt router, is to make the cost of all traffic in- and out of the router be synthetically
high. I do this by adding +1000 to the OSPF cost.
***BGP***: But there are also a bunch of internet exchanges (such as Kleyrex, DE-CIX and LoCIX), and two IP
transit upstreams (IP-Max and Meerfarbig) connected to this router in Frankfurt. I do not want
them to send IPng any traffic here during the maintenance, so I will drain eBGP as well by setting the
groups to _shutdown_ state in Kees.
```
pim@squanchy:~/src/ipng-kees$ git diff
diff --git a/config/defra0.ipng.ch.yaml b/config/defra0.ipng.ch.yaml
index 869058c..105630c 100644
--- a/config/defra0.ipng.ch.yaml
+++ b/config/defra0.ipng.ch.yaml
@@ -151,12 +151,13 @@ vppcfg:
ospf:
xe1-0.304:
description: chrma0
- cost: 56
+ cost: 1056
xe1-1.302:
description: defra0
- cost: 61
+ cost: 1061
ebgp:
+ shutdown: true
groups:
decix_dus:
local-addresses: [ 185.1.171.43/23, 2001:7f8:9e::206a:0:1/64 ]
```
By raising the OSPF cost, the network will route around the machine that I want to play with:
```
pim@squanchy:~/src/ipng-kees$ traceroute nlams0.ipng.ch
traceroute to defra0.ipng.ch (194.1.163.32), 64 hops max, 40 byte packets
1 chbtl0 (194.1.163.66) 0.492 ms 0.64 ms 0.615 ms
2 chrma0 (194.1.163.17) 1.268 ms 1.196 ms 1.194 ms
3 chplo0 (194.1.163.51) 5.682 ms 5.514 ms 5.603 ms
4 frpar0 (194.1.163.40) 14.481 ms 14.605 ms 14.58 ms
5 frggh0 (194.1.163.30) 19.545 ms 18.61 ms 18.684 ms
6 nlams0 (194.1.163.32) 47.613 ms 47.765 ms 47.584 ms
```
And by setting the sessions to _shutdown_, Kees will make it regenerate all of the BGP sessions
with an `export none` and a low `bgp_local_pref`, which will make the router itself stop announcing
any prefixes, for example this session in D&uuml;sseldorf:
```
@@ -25,11 +25,11 @@ protocol bgp decix_dus_56890_ipv4_1 {
source address 185.1.171.43;
neighbor 185.1.170.252 as 56890;
default bgp_med 0;
- default bgp_local_pref 200;
+ default bgp_local_pref 0; # shutdown
ipv4 {
import keep filtered;
import filter ebgp_decix_dus_56890_import;
- export filter ebgp_decix_dus_56890_export;
+ export none; # shutdown
receive limit 250000 action restart;
next hop self on;
};
```
{{< image width="80px" float="left" src="/assets/debian-vpp/warning.png" alt="Warning" >}}
This is where it's a good idea to grab some tea. Quite a few internet providers have
incredibly slow convergence, so just by stopping the announcment of `AS8298:AS-IPNG` prefixes at
this internet exchange, doesn't mean things get updated too quickly. It makes sense to wait a few
minutes (by default I wait 15min) so that every router that might be a slow-poke (I'm looking at
you, Juniper!) has time to update their RIB and FIB.
VPP itself pretty immediately flips all of its paths to other places, and it converges a full table
of 950K IPv4 and 195K IPv6 routes in about 7 seconds or so, but not everybody has such fast CPUs in
their vendor-silicon-fancypants-router :-)
### Upgrade
The tea in my advents calendar for December 17th is _Whittard's Lemon & Ginger_ infusion, and it is
delicious. What could possibly go wrong?! Now that the router is fully drained, I start a ping to
the loopback, and flip the virtual powerswitch on the IPMI console. A few seconds later, the machine
expectedly stops pinging and ... the world doesn't end, my SSH session to a hypervisor in Amsterdam
is still alive, and most importantly, Spotify is still playing music:
```
pim@squanchy:~/src/ipng-kees$ ping defra0.ipng.ch
PING defra0.ipng.ch (194.1.163.7): 56 data bytes
64 bytes from 194.1.163.7: icmp_seq=0 ttl=62 time=6.3 ms
64 bytes from 194.1.163.7: icmp_seq=1 ttl=62 time=6.5 ms
64 bytes from 194.1.163.7: icmp_seq=2 ttl=62 time=6.2 ms
...
```
I open the IPMI KVM console, hit F10 and select the CDROM option, which has my previously inserted
Debian 12 _netinst_ ISO:
{{< image src="/assets/debian-vpp/debian-ipmi.png" alt="Debian on IPMI" >}}
At this point I can't help but smile. I'm sitting here in Br&uuml;ttisellen, roughly 400km south of
this computer in Frankfurt, and I am looking at the VGA output of a fresh Debian installer. Come on,
you have to admit, that's pretty slick! Installing Debian follows pretty precisely my previous VPP#7
article [[ref]({% post_url 2021-09-21-vpp-7 %})]. I go through the installer options and a few
minutes later, it's mission accomplished. I give the router its IPv4/IPv6 address in _IPng Site
Local_, so that it has management network connectivity, and just before it wants to reboot, I
quickly edit `/etc/default/grub` to turn on serial output, just like in the article:
```
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 isolcpus=1,2,3,5,6,7"
GRUB_TERMINAL=serial
GRUB_SERIAL_COMMAND="serial --unit=0 --speed=115200 --stop=1 --parity=no --word=8"
```
As the machine reboots, I eject the CDROM from the IPMI web interface, and attach to the
serial-over-lan interface instead. Booyah, it boots!
### Configure
On my workstation, I mount yesterday's Borg backup for the machine, because instead of doing the
whole router build over from scratch, I'm going to selectively copy a few bits and pieces over, in
the interest of time. Also, it's nice to actually use borgbackup for once, although Fred and I have
made grateful use of it in an emergency when one of IP-Max's hypervisors failed in Geneva.
```
pim@summer:~$ sudo borg mount ssh://${BORG_REPO}/defra0.ipng.ch/ /var/borgbackup/
Enter passphrase for key ssh://${BORG_REPO}/defra0.ipng.ch:
pim@summer:~$ sudo ls -l /var/borgbackup/defra0-2023-12-17T01:45:47.983599
bin boot cdrom etc home lib lib32 lib64 libx32 lost+found media mnt opt
root sbin srv tmp usr var
```
In case you're wondering why I mount the backup as root, it's because that way I can guarantee all
the correct users/permissions etc are present in the restore. I've done a practice run of the
upgrade, yesterday, at `chplo0.ipng.ch`, so by now I think I have a pretty good handle on what needs
to happen, so while connected to the freshly installed Debian Bookworm machine via serial-over-lan,
here's what I do:
```
root@defra0:~# apt install sudo rsync net-tools traceroute snmpd snmp iptables ipmitool bird2 \
lm-sensors netplan.io build-essential borgmatic unbound tcpdump \
libnl-3-200 libnl-route-3-200
root@defra0:~# adduser pim sudo
root@defra0:~# adduser pim bird
root@defra0:~# systemctl stop bird; systemctl disable bird; systemctl mask bird
root@defra0:~# sensors-detect --auto
root@defra0:~# export REPO=summer.net.ipng.ch:/var/borgbackup/defra0-2023-12-17T01:45:47.983599
root@defra0:~# mv /etc/network/interfaces /etc/network/interfaces.orig
root@defra0:~# rsync -avugP $REPO/etc/netplan/ /etc/netplan/
root@defra0:~# rm -f /etc/ssh/ssh_host*
root@defra0:~# rsync -avugP $REPO/etc/ssh/ssh_host* /etc/ssh/
root@defra0:~# rsync -avugP $REPO/etc/sysctl.d/80* /etc/sysctl.d/
root@defra0:~# rsync -avugP $REPO/etc/bird/ /etc/bird/
root@defra0:~# rsync -avugP $REPO/etc/vpp/ /etc/vpp/
root@defra0:~# rsync -avugP $REPO/etc/borgmatic/ /etc/borgmatic/
root@defra0:~# rsync -avugP $REPO/etc/rc.local /etc/rc.local
root@defra0:~# rsync -avugP $REPO/lib/systemd/system/*dataplane* /lib/systemd/system
```
I decide to selectively copy only the specific configuration files necessary to boot the dataplane.
This means the systemd services (like snmpd, sshd, and their network namespace), and all the Bird
and VPP config files. Because I prefer not to have to clear the SSH host keys, I also copy the old
SSH host keys over. And considering IPng Networks standardizes on netplan for interface config, I'll
move the Debian-default `interfaces` out of the way.
Finally, I add a few finishing touches and reboot one last time to ensure things are settled:
```
root@defra0:~# cat << EOF | tee -a /etc/modules
coretemp
mpls_router
vfio_pci
EOF
root@defra0:~# update-initramfs -k all -u
root@defra0:~# update-grub
root@defra0:~# mkdir -p /etc/systemd/system/unbound.service.d/
root@defra0:~# mkdir -p /etc/systemd/system/snmpd.service.d/
root@defra0:~# cat << EOF | tee /etc/systemd/system/unbound.service.d/override.conf
[Service]
NetworkNamespacePath=/var/run/netns/dataplane
EOF
root@defra0:~# cp /etc/systemd/system/unbound.service.d/override.conf \
/etc/systemd/system/snmpd.service.d/override.conf
root@defra0:~# reboot
```
The machine once again comes up, and now it's loaded the VFIO and MPLS kernel modules, so I'm ready
for the grand finale, which is installing VPP at the same version as the other routers in the fleet:
```
root@defra0:~# mkdir -p /var/log/vpp/
root@defra0:~# wget -m --no-parent https://ipng.ch/media/vpp/bookworm/24.02-rc0~175-g31d4891cf/
root@defra0:~# dpkg -i ipng.ch/media/vpp/bookworm/24.02-rc0~175-g31d4891cf/*.deb
root@defra0:~# adduser pim vpp
root@defra0:~# vppctl show version
vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52
```
In the corner of my eye, I see one of my xterms move. Hah! It's the ping I left running on
squanchy before, check it out:
```
pim@squanchy:~/src/ipng-kees$ ping defra0.ipng.ch
PING defra0.ipng.ch (194.1.163.7): 56 data bytes
64 bytes from 194.1.163.7: icmp_seq=0 ttl=62 time=6.3 ms
64 bytes from 194.1.163.7: icmp_seq=1 ttl=62 time=6.5 ms
64 bytes from 194.1.163.7: icmp_seq=2 ttl=62 time=6.2 ms
...
64 bytes from 194.1.163.7: icmp_seq=1484 ttl=62 time=6.5 ms
64 bytes from 194.1.163.7: icmp_seq=1485 ttl=62 time=6.6 ms
64 bytes from 194.1.163.7: icmp_seq=1486 ttl=62 time=6.8 ms
```
One think-o I made is that the Bird configs that I just put back from the backup were those from before I
set the drains (remember, raising the OSPF cost and setting the EBGP sessions to _shutdown_) so they
are now all alive again. But it's all good - the dataplane came up, Bird2 came up and formed OSPF
and OSPFv3 adjacencies a few seconds later, and BGP sessions all shot to life. I take a quick look
at the state of the dataplane to make sure I'm not accidentally introducing a broken router:
```
pim@defra0:~$ birdc show route count
BIRD 2.0.12 ready.
6782372 of 6782372 routes for 958020 networks in table master4
1848350 of 1848350 routes for 198255 networks in table master6
1620753 of 1620753 routes for 405189 networks in table t_roa4
367875 of 367875 routes for 91969 networks in table t_roa6
Total: 10619350 of 10619350 routes for 1653433 networks in 4 tables
pim@defra0:~$ vppctl show ip fib summary | awk '{ TOTAL += $2 } END { print TOTAL }'
958664
pim@defra0:~$ vppctl show ip6 fib summary | awk '{ TOTAL += $2 } END { print TOTAL }'
198322
```
OK, looking at the output I can conclude that my think-o was benign and the router has all routes
accounted for in the RIB, it has slurped in the RPKI tables, and it has successfully transferred all
of this into VPP's FIB. So this entire upgrade took 1482 seconds, which is just under 25 minutes.
***Gnarly!***
### Post Install
The machine is up and running, and there's one last thing for me to do, which is perform an Ansble
run to make sure that the whole machine is configured correctly (for example, the correct access
list for _Unbound_, the correct IPv4/IPv6 firewall for the Linux controlplane, the correct SSH
daemon options, working mailer and NTP daemon, et cetera).
So I fire off a one-shot Ansible playbook run, and it pokes and prods the machine a bit:
{{< image src="/assets/debian-vpp/ansible.png" alt="Ansible" >}}
Now the machine is completely up-to-snuff, its latest VPP SNMP agent Prometheus exporter, Bird
exporter, and so on are all good. I check LibreNMS and indeed, the machine is back with a half an
hour or so of monitoring data missing. I'm still grinning as I write this, as most Juniper and Cisco
_firmware_ upgrades take more than 30min, while for me the whole thing from start to finish was less
than that.
## Results
{{< image src="/assets/debian-vpp/smokeping.png" alt="Smokeping" >}}
This article describes how I managed to upgrade the entire network of routers, remotely, from the
comfort of my home, while sipping tea, and without having a single network outage. The bump in the
graph is the moment at which I drained `defra0` and traffic from the monitoring machine at `nlams0`
had to go via France to my house at `chbtl0`. No packets were lost in the making of this upgrade!
Yesterday I practiced on `chplo0`, and today for this article I did `defra0`, after which I also
did the last remaining router `nlams0`. Every router is now up to date running Debian Bookworm as
well as VPP version 24.02 (including a bunch of desirable fixes for IPFIX/Flowprobe):
```
pim@squanchy:~/src/ipng-kees$ ./doall.sh 'echo -n $(hostname -s):\ ; vppctl show version'
chbtl0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52
chbtl1: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52
chgtg0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52
chplo0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52
chrma0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52
ddln0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52
ddln1: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52
defra0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52
frggh0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52
frpar0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52
nlams0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52
usfmt0: vpp v24.02-rc0~175-g31d4891cf built by pim on bullseye-builder at 2023-12-09T16:27:33
```
For the hawk-eyed, yes `usfmt0` has not been done. I don't have Supermicro with IPMI there, so the
next time I visit California, I'll make a stop at the local Hurricane Electric datacenter to upgrade
that last one :-)

View File

@ -0,0 +1,717 @@
---
date: "2024-01-27T10:01:14Z"
title: VPP Python API
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
You'll hear me talk about VPP being API centric, with no configuration persistence, and that's by
design. However, there is this also a CLI utility called `vppctl`, right, so what gives? In truth,
the CLI is used a lot by folks to configure their dataplane, but it really was always meant to be
a debug utility. There's a whole wealth of programmability that is _not_ exposed via the CLI at all,
and the VPP community develops and maintains an elaborate set of tools to allow external programs
to (re)configure the dataplane. One such tool is my own [[vppcfg]({% post_url 2022-04-02-vppcfg-2
%})] which takes a YAML specification that describes the dataplane configuration, and applies it
safely to a running VPP instance.
## Introduction
In case you're interested in writing your own automation, this article is for you! I'll provide a
deep dive into the Python API which ships with VPP. It's actually very easy to use once you get used
it it -- assuming you know a little bit of Python of course :)
### VPP API: Anatomy
When developers write their VPP features, they'll add an API definition file that describes
control-plane messages that are typically called via shared memory interface, which explains why
these things are called _memclnt_ in VPP. Certain API _types_ can be created, resembling their
underlying C structures, and these types are passed along in _messages_. Finally a _service_ is a
**Request/Reply** pair of messages. When requests are received, VPP executes a _handler_ whose job it
is to parse the request and send either a singular reply, or a _stream_ of replies (like a list of
interfaces).
Clients connect to a unix domain socket, typically `/run/vpp/api.sock`. A TCP port can also
be used, with the caveat that there is no access control provided. Messages are exchanged over this
channel _asynchronously_. A common pattern of async API design is to have a client identifier
(called a _client_index_) and some random number (called a _context_) with which the client
identifies their request. Using these two things, VPP will issue a callback using (a) the
_client_index_ to send the reply to and (b) the client knows which _context_ the reply is meant for.
By the way, this _asynchronous_ design pattern gives programmers one really cool benefit out of the
box: events that are not explicitly requested, like say, link-state change on an interface, can now
be implemented by simply registering a standing callback for a certain message type - I'll show how
that works at the end of this article. As a result, any number of clients, their requests and even
arbitrary VPP initiated events can be in flight at the same time, which is pretty slick!
#### API Types
Most APIs requests pass along datastructures, which follow their internal representation in VPP. I'll start
by taking a look at a simple example -- the VPE itself. It defines a few things in
`src/vpp/vpe_types.api`, notably a few type definitions and one enum:
```
typedef version
{
u32 major;
u32 minor;
u32 patch;
/* since we can't guarantee that only fixed length args will follow the typedef,
string type not supported for typedef for now. */
u8 pre_release[17]; /* 16 + "\0" */
u8 build_metadata[17]; /* 16 + "\0" */
};
typedef f64 timestamp;
typedef f64 timedelta;
enum log_level {
VPE_API_LOG_LEVEL_EMERG = 0, /* emerg */
VPE_API_LOG_LEVEL_ALERT = 1, /* alert */
VPE_API_LOG_LEVEL_CRIT = 2, /* crit */
VPE_API_LOG_LEVEL_ERR = 3, /* err */
VPE_API_LOG_LEVEL_WARNING = 4, /* warn */
VPE_API_LOG_LEVEL_NOTICE = 5, /* notice */
VPE_API_LOG_LEVEL_INFO = 6, /* info */
VPE_API_LOG_LEVEL_DEBUG = 7, /* debug */
VPE_API_LOG_LEVEL_DISABLED = 8, /* disabled */
};
```
By doing this, API requests and replies can start referring to these types in. When reading this, it
feels a bit like a C header file, showing me the structure. For example, I know that if I ever need
to pass along an argument called `log_level`, I know which values I can provide, together with their
meaning.
#### API Messages
I now take a look at `src/vpp/api/vpe.api` itself, this is where the VPE API is defined. It includes
the former `vpe_types.api` file, so it can reference these typedefs and the enum. Here, I see a few
messages defined that constitute a **Request/Reply** pair:
```
define show_version
{
u32 client_index;
u32 context;
};
define show_version_reply
{
u32 context;
i32 retval;
string program[32];
string version[32];
string build_date[32];
string build_directory[256];
};
```
There's one small surprise here out of the gate. I would've expected that beautiful typedef called
`version` from the `vpe_types.api` file to make an appearance, but it's conspicuously missing from
the `show_version_reply` message. Ha! But the rest of it seems reasonably self-explanatory -- as I
already know about the _client_index_ and _context_ fields, I now know that this request does not
carry any arguments, and that the reply has a _retval_ for application errors, similar to how most
libC functions return 0 on success, and some negative value error number defined in
[[errno.h](https://en.wikipedia.org/wiki/Errno.h)]. Then, there are four strings of the given
length, which I should be able to consume.
#### API Services
The VPP API defines three types of message exchanges:
1. **Request/Reply** - The client sends a request message and the server replies with a single reply
message. The convention is that the reply message is named as `method_name + _reply`.
1. **Dump/Detail** - The client sends a “bulk” request message to the server, and the server replies
with a set of detail messages. These messages may be of different type. The method name must end
with `method + _dump`, the reply message should be named `method + _details`. These Dump/Detail
methods are typically used for acquiring bulk information, like the complete FIB table.
1. **Events** - The client can register for getting asynchronous notifications from the server. This
is useful for getting interface state changes, and so on. The method name for requesting
notifications is conventionally prefixed with `want_`, for example `want_interface_events`.
If the convention is kept, the API machinery will correlate the `foo` and `foo_reply` messages into
RPC services. But it's also possible to be explicit about these, by defining _service_ scopes in the
`*.api` files. I'll take two examples, the first one is from the Linux Control Plane plugin (which
I've [[written about]({% post_url 2021-08-12-vpp-1 %})] a lot while I was contributing to it back in
2021).
**Dump/Detail (example)**: When enumerating _Linux Interface Pairs_, the service definition looks like
this:
```
service {
rpc lcp_itf_pair_get returns lcp_itf_pair_get_reply
stream lcp_itf_pair_details;
};
```
To puzzle this together, the request called `lcp_itf_pair_get` is paired up with a reply called
`lcp_itf_pair_get_reply` followed by a stream of zero-or-more `lcp_itf_pair_details` messages. Note
the use of the pattern _rpc_ X _returns_ Y _stream_ Z.
**Events (example)**: I also take a look at an event handler like the one in the interface API that
made an appearance in my list of API message types, above:
```
service {
rpc want_interface_events returns want_interface_events_reply
events sw_interface_event;
};
```
Here, the request is `want_interface_events` which returns a `want_interface_events_reply` followed
by zero or more `sw_interface_event` messages, which is very similar to the streaming (dump/detail)
pattern. The semantic difference is that streams are lists of things and events are asynchronously
happening things in the dataplane -- in other words the _stream_ is meant to end whlie the _events_
messages are generated by VPP when the event occurs. In this case, if an interface is created or
deleted, or the link state of an interface changes, one of these is sent from VPP to the client(s)
that registered an interested in it by calling the `want_interface_events` RPC.
#### JSON Representation
VPP comes with an internal API compiler that scans the source code for these `*.api` files and
assembles them into a few output formats. I take a look at the Python implementation of it in
`src/tools/vppapigen/` and see that it generates C, Go and JSON. As an aside, I chuckle a little bit
on a Python script generating Go and C, but I quickly get over myself. I'm not that funny.
The `vppapigen` tool outputs a bunch of JSON files, one per API specification, which wraps up all of
the information from the _types_, _unions_ and _enums_, the _message_ and _service_ definitions,
together with a few other bits and bobs, and when VPP is installed, these end up in
`/usr/share/vpp/api/`. As of the upcoming VPP 24.02 release, there's about 50 of these _core_ APIs
and an additional 80 or so APIs defined by _plugins_ like the Linux Control Plane.
Implementing APIs is pretty user friendly, largely due to the `vppapigen` tool taking so much of the
boilerplate and autogenerating things. As an example, I need to be able to enumerate the interfaces
that are MPLS enabled, so that I can use my [[vppcfg]({% post_url 2022-03-27-vppcfg-1 %})] utility to
configure MPLS. I contributed an API called `mpls_interface_dump` which returns a
stream of `mpls_interface_details` messages. You can see that small contribution in merged [[Gerrit
39022](https://gerrit.fd.io/r/c/vpp/+/39022)].
## VPP Python API
The VPP API has been ported to many languages (C, C++, Go, Lua, Rust, Python, probably a few others).
I am primarily a user of the Python API, which ships alongside VPP in a separate Debian package. The
source code lives in `src/vpp-api/python/` which doesn't have any dependencies other than Python's
own `setuptools`. Its implementation canonically called `vpp_papi`, which, I cannot tell a lie,
reminds me of spanish rap music. But, if you're still reading, maybe now is a good time to depart
from the fundamental, and get to the practical!
### Example: Hello World
Without further ado, I dive right in with this tiny program:
```python
from vpp_papi import VPPApiClient, VPPApiJSONFiles
vpp_api_dir = VPPApiJSONFiles.find_api_dir([])
vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir)
vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock")
vpp.connect("ipng-client")
api_reply = vpp.api.show_version()
print(api_reply)
```
The first thing this program does is construct a so-called `VPPApiClient` object. To do this, I need
to feed it a list of JSON definitions, so that it knows what types of APIs are available.
As I
mentioned above, those
to create the list of files, but there are two handy helpers here:
1. **find_api_dir()** - This is a helper that finds the location of the API files. Normally, the JSON
files get installed in `/usr/share/vpp/api/`, but when I'm writing code, it's more likely that the
files are in `/home/pim/src/vpp/` somewhere. This helper function tries to do the right thing and
detect if I'm in a client or if I'm using a production install, and will return the correct
directory.
1. **find_api_files()** - Now, I could rummage through that directory and find the JSON files, but
there's another handy helper that does that for me, given a directory (like the one I just got
handed to me). Life is easy.
Once I have the JSON files in hand, I can construct a client by specifying the _server_address_
location to connect to -- this is typically a unix domain socket in `/run/vpp/api.sock` but it can
also be a TCP endpoint. As a quick aside: If you, like me, stumbled over the socket being owned by
`root:vpp` but not writable by the group, that finally got fixed by Georgy in
[[Gerrit 39862](https://gerrit.fd.io/r/c/vpp/+/39862)].
Once I'm connected, I can start calling arbitrary API methods, like `show_version()` which does not
take any arguments. Its reply is a named tuple, and it looks like this:
```bash
pim@vpp0-0:~/vpp_papi_examples$ ./00-version.py
show_version_reply(_0=1415, context=1,
retval=0, program='vpe', version='24.02-rc0~46-ga16463610',
build_date='2023-10-15T14:50:49', build_directory='/home/pim/src/vpp')
```
And here is my beautiful hello world in seven (!) lines of code. All that reading and preparing
finally starts paying off. Neat-oh!
### Example: Listing Interfaces
From here on out, it's just incremental learning. Here's an example of how to extend the hello world
example above and make it list the dataplane interfaces and their IPv4/IPv6 addresses:
```python
api_reply = vpp.api.sw_interface_dump()
for iface in api_reply:
str = f"[{iface.sw_if_index}] {iface.interface_name}"
ipr = vpp.api.ip_address_dump(sw_if_index=iface.sw_if_index, is_ipv6=False)
for addr in ipr:
str += f" {addr.prefix}"
ipr = vpp.api.ip_address_dump(sw_if_index=iface.sw_if_index, is_ipv6=True)
for addr in ipr:
str += f" {addr.prefix}"
print(str)
```
The API method `sw_interface_dump()` can take a few optional arguments. Notably, if `sw_if_index` is
set, the call will dump that exact interface. If it's not set, it will default to -1 which will dump
all interfaces, and this is how I use it here. For completeness, the method also has an optional
string `name_filter`, which will dump all interfaces which contain a given substring. For example
passing `name_filter='loop'` and `name_filter_value=True` as arguments, would enumerate all interfaces
that have the word 'loop' in them.
Now, the definition of the `sw_interface_dump` method suggests that it returns a stream (remember
the **Dump/Detail** pattern above), so I can predict that the messages I will receive are of type
`sw_interface_details`. There's *lots* of cool information in here, like the MAC address, MTU,
encapsulation (if this is a sub-interface), but for now I'll only make note of the `sw_if_index` and
`interface_name`.
Using this interface index, I then call the `ip_address_dump()` method, which looks like this:
```
define ip_address_dump
{
u32 client_index;
u32 context;
vl_api_interface_index_t sw_if_index;
bool is_ipv6;
};
define ip_address_details
{
u32 context;
vl_api_interface_index_t sw_if_index;
vl_api_address_with_prefix_t prefix;
};
```
Allright then! If I want the IPv4 addresses for a given interface (referred to not by its name, but
by its index), I can call it with argument `is_ipv6=False`. The return is zero or more messages that
contain the index again, and a prefix the precise type of which can be looked up in `ip_types.api`.
After doing a form of layer-one traceroute through the API specification files, it turns out, that
this prefix is cast to an instance of the `IPv4Interface()` class in Python. I won't bore you with
it, but the second call sets `is_ipv6=True` and, unsurprisingly, returns a bunch of
`IPv6Interface()` objects.
To put it all together, the output of my little script:
```
pim@vpp0-0:~/vpp_papi_examples$ ./01-interface.py
VPP version is 24.02-rc0~46-ga16463610
[0] local0
[1] GigabitEthernet10/0/0 192.168.10.5/31 2001:678:d78:201::fffe/112
[2] GigabitEthernet10/0/1 192.168.10.6/31 2001:678:d78:201::1:0/112
[3] GigabitEthernet10/0/2
[4] GigabitEthernet10/0/3
[5] loop0 192.168.10.0/32 2001:678:d78:200::/128
```
### Example: Linux Control Plane
Normally, services of are either a **Request/Reply** or a **Dump/Detail** type. But careful readers may
have noticed that the Linux Control Plane does a little bit of both. It has a **Request/Reply/Detail**
triplet, because for request `lcp_itf_pair_get`, it will return a `lcp_itf_pair_get_reply` AND a
_stream_ of `lcp_itf_pair_details`. Perhaps in hindsight a more idiomatic way to do this was to have
created simply a `lcp_itf_pair_dump`, but considering this is what we ended up with, I can use it as
a good example case -- how might I handle such a response?
```python
api_reply = vpp.api.lcp_itf_pair_get()
if isinstance(api_reply, tuple) and api_reply[0].retval == 0:
for lcp in api_reply[1]:
str = f"[{lcp.vif_index}] {lcp.host_if_name}"
api_reply2 = vpp.api.sw_interface_dump(sw_if_index=lcp.host_sw_if_index)
tap_iface = api_reply2[0]
api_reply2 = vpp.api.sw_interface_dump(sw_if_index=lcp.phy_sw_if_index)
phy_iface = api_reply2[0]
str += f" tap {tap_iface.interface_name} phy {phy_iface.interface_name} mtu {phy_iface.link_mtu}"
print(str)
```
This particular API first sends its _reply_ and then its _stream_, so I can expect it to be a tuple
with the first element being a namedtuple and the second element being a list of details messages. A
good way to ensure that is to check for the reply's _retval_ field to be 0 (success) before trying
to enumerate the _Linux Interface Pairs_. These consist of a VPP interface (say
`GigabitEthernet10/0/0`), which corresponds to a TUN/TAP device which in turn has a VPP name (eg
`tap1`) and a Linux name (eg. `e0`).
The Linux Control Plane call will return these dataplane objects as numerical interface indexes,
not names. However, I can resolve them to names by calling the `sw_interface_dump()` method and
specifying the index as an argument. Because this is a **Dump/Detail** type API call, the return will
be a _stream_ (a list), which will have either zero (if the index didn't exist), or one element
(if it did).
Using this I can puzzle together the following output:
```
pim@vpp0-0:~/vpp_papi_examples$ ./02-lcp.py
VPP version is 24.02-rc0~46-ga16463610
[2] loop0 tap tap0 phy loop0 mtu 9000
[3] e0 tap tap1 phy GigabitEthernet10/0/0 mtu 9000
[4] e1 tap tap2 phy GigabitEthernet10/0/1 mtu 9000
[5] e2 tap tap3 phy GigabitEthernet10/0/2 mtu 9000
[6] e3 tap tap4 phy GigabitEthernet10/0/3 mtu 9000
```
### VPP's Python API objects
The objects in the VPP dataplane can be arbitrarily complex. They can have nested objects, enums,
unions, repeated fields and so on. To illustrate a more complete example, I will take a look at an
MPLS tunnel object in the dataplane. I first create the MPLS tunnel using the CLI, as follows:
```
vpp# mpls tunnel l2-only via 192.168.10.3 GigabitEthernet10/0/1 out-labels 8298 100 200
vpp# mpls local-label add 8298 eos via l2-input-on mpls-tunnel0
```
The first command creates an interface called `mpls-tunnel0` which, if it receives an ethernet frame, will
encapsulate it into an MPLS datagram with a labelstack of `8298.100.200`, and then forward it on to
the router at 192.168.10.3. The second command adds a FIB entry to the MPLS table, upon receipt of a
datagram with the label `8298`, unwrap it and present the resulting datagram contents as an ethernet
frame into `mpls-tunnel0`. By cross connecting this MPLS tunnel with any other dataplane interface
(for example, `HundredGigabitEthernet10/0/1.1234`), this would be an elegant way to configure a
classic L2VPN ethernet-over-MPLS transport. Which is hella cool, but I digress :)
I want to inspect this tunnel using the API, and I find an `mpls_tunnel_dump()` method. As we
know well by now, this is a **Dump/Detail** type method, so the return value will be a list of
zero or more `mpls_tunnel_details` messages.
The `mpls_tunnel_details` message is simply a wrapper around an `mpls_tunnel` type as can be seen in
`mpls.api`, and it references the `fib_path` type as well. Here they are:
```
typedef fib_path
{
u32 sw_if_index;
u32 table_id;
u32 rpf_id;
u8 weight;
u8 preference;
vl_api_fib_path_type_t type;
vl_api_fib_path_flags_t flags;
vl_api_fib_path_nh_proto_t proto;
vl_api_fib_path_nh_t nh;
u8 n_labels;
vl_api_fib_mpls_label_t label_stack[16];
};
typedef mpls_tunnel
{
vl_api_interface_index_t mt_sw_if_index;
u32 mt_tunnel_index;
bool mt_l2_only;
bool mt_is_multicast;
string mt_tag[64];
u8 mt_n_paths;
vl_api_fib_path_t mt_paths[mt_n_paths];
};
define mpls_tunnel_details
{
u32 context;
vl_api_mpls_tunnel_t mt_tunnel;
};
```
Taking a closer look, the `mpls_tunnel` message consists of an index, then an `mt_tunnel_index`
which corresponds to the tunnel number (ie. interface mpls-tunnel**N**), some boolean flags, and
then a vector of N FIB paths. Incidentally, you'll find FIB paths all over the place in VPP: in
routes, tunnels like this one, ACLs, and so on, so it's good to get to know them a bit.
Remember when I created the tunnel, I specified something like `.. via ..`? That's a tell-tale
sign that what follows is a FIB path. I specified a nexhop (192.168.10.3 GigabitEthernet10/0/3) and
a list of three out-labels (8298, 100 and 200), all of which VPP has tucked them away in this
`mt_paths` field.
Although it's a bit verbose, I'll paste the complete object for this tunnel, including the FIB path.
You know, for science:
```
mpls_tunnel_details(_0=1185, context=5,
mt_tunnel=vl_api_mpls_tunnel_t(
mt_sw_if_index=17,
mt_tunnel_index=0,
mt_l2_only=True,
mt_is_multicast=False,
mt_tag='',
mt_n_paths=1,
mt_paths=[
vl_api_fib_path_t(sw_if_index=2, table_id=0,rpf_id=0, weight=1, preference=0,
type=<vl_api_fib_path_type_t.FIB_API_PATH_TYPE_NORMAL: 0>,
flags=<vl_api_fib_path_flags_t.FIB_API_PATH_FLAG_NONE: 0>,
proto=<vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_IP4: 0>,
nh=vl_api_fib_path_nh_t(
address=vl_api_address_union_t(
ip4=IPv4Address('192.168.10.3'), ip6=IPv6Address('c0a8:a03::')),
via_label=0, obj_id=0, classify_table_index=0),
n_labels=3,
label_stack=[
vl_api_fib_mpls_label_t(is_uniform=0, label=8298, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=100, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=200, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0),
vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0)
]
)
]
)
)
```
This `mt_paths` is really interesting, and I'd like to make a few observations:
* **type**, **flags** and **proto** are ENUMs which I can find in `fib_types.api`
* **nh** is the nexthop - there is only one nexthop specified per path entry, so when things like
ECMP multipath are in play, this will be a vector of N _paths_ each with one _nh_. Good to know.
This nexthop specifies an address which is a _union_ just like in C. It can be either an
_ip4_ or an _ip6_. I will know which to choose due to the _proto_ field above.
* **n_labels** and **label_stack**: The MPLS label stack has a fixed size. VPP reveals here (in
the API definition but also in the response) that the label-stack can be at most 16 labels deep.
I feel like this is an interview question at Cisco, somehow. I know how many labels are relevant
because of the _n_labels_ field above. Their type is of `fib_mpls_label` which can be found in
`mpls.api`.
After having consumed all of this, I am ready to write a program that wheels over these message
types and prints something a little bit more compact. The final program, in all of its glory --
```python
from vpp_papi import VPPApiClient, VPPApiJSONFiles, VppEnum
def format_path(path):
str = ""
if path.proto == VppEnum.vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_IP4:
str += f" ipv4 via {path.nh.address.ip4}"
elif path.proto == VppEnum.vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_IP6:
str += f" ipv6 via {path.nh.address.ip6}"
elif path.proto == VppEnum.vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_MPLS:
str += f" mpls"
elif path.proto == VppEnum.vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_ETHERNET:
api_reply2 = vpp.api.sw_interface_dump(sw_if_index=path.sw_if_index)
iface = api_reply2[0]
str += f" ethernet to {iface.interface_name}"
else:
print(path)
if path.n_labels > 0:
str += " label"
for i in range(path.n_labels):
str += f" {path.label_stack[i].label}"
return str
api_reply = vpp.api.mpls_tunnel_dump()
for tunnel in api_reply:
str = f"Tunnel [{tunnel.mt_tunnel.mt_sw_if_index}] mpls-tunnel{tunnel.mt_tunnel.mt_tunnel_index}"
for path in tunnel.mt_tunnel.mt_paths:
str += format_path(path)
print(str)
api_reply = vpp.api.mpls_table_dump()
for table in api_reply:
print(f"Table [{table.mt_table.mt_table_id}] {table.mt_table.mt_name}")
api_reply2 = vpp.api.mpls_route_dump(table=table.mt_table.mt_table_id)
for route in api_reply2:
str = f" label {route.mr_route.mr_label} eos {route.mr_route.mr_eos}"
for path in route.mr_route.mr_paths:
str += format_path(path)
print(str)
```
{{< image width="50px" float="left" src="/assets/vpp-papi/warning.png" alt="Warning" >}}
Funny detail - it took me almost two years to discover `VppEnum`, which contains all of these
symbols. If you end up reading this after a Bing, Yahoo or DuckDuckGo search, feel free to buy
me a bottle of Glenmorangie - sl&aacute;inte!
The `format_path()` method here has the smarts. Depending on the _proto_ field, I print either
an IPv4 path, an IPv6 path, an internal MPLS path (for example for the reserved labels 0..15), or an
Ethernet path, which is the case in the FIB entry above that diverts incoming packets with label 8298 to be
presented as ethernet datagrams into the intererface `mpls-tunnel0`. If it is an Ethernet proto, I
can use the _sw_if_index_ field to figure out which interface, and retrieve its details to find its
name.
The `format_path()` method finally adds the stack of labels to the returned string, if the _n_labels_
field is non-zero.
My program's output:
```
pim@vpp0-0:~/vpp_papi_examples$ ./03-mpls.py
VPP version is 24.02-rc0~46-ga16463610
Tunnel [17] mpls-tunnel0 ipv4 via 192.168.10.3 label 8298 100 200
Table [0] MPLS-VRF:0
label 0 eos 0 mpls
label 0 eos 1 mpls
label 1 eos 0 mpls
label 1 eos 1 mpls
label 2 eos 0 mpls
label 2 eos 1 mpls
label 8298 eos 1 ethernet to mpls-tunnel0
```
### Creating VxLAN Tunnels
Until now, all I've done is _inspect_ the dataplane, in other words I've called a bunch of APIs
that do not change state. Of course, many of VPP's API methods _change_ state as well. I'll turn to
another example API -- The VxLAN tunnel API is defined in `plugins/vxlan/vxlan.api` and it's gone
through a few iterations. The VPP community tries to keep backwards compatibility, and a simple way
of doing this is to create new versions of the methods by tagging them with suffixes such as `_v2`,
while eventually marking the older versions as deprecated by setting the `option deprecated;` field
in the definition. In this API specification I can see that we're already at version 3 of the
**Request/Reply** method in `vxlan_add_del_tunnel_v3` and version 2 of the **Dump/Detail** method in
`vxlan_tunnel_v2_dump`.
Once again, using these `*.api` defintions, finding an incantion to create a unicast VxLAN tunnel
with a given VNI, then listing the tunnels, and finally deleting the tunnel I just created, would
look like this:
```python
api_reply = vpp.api.vxlan_add_del_tunnel_v3(is_add=True, instance=100, vni=8298,
src_address="192.0.2.1", dst_address="192.0.2.254", decap_next_index=1)
if api_reply.retval == 0:
print(f"Created VXLAN tunnel with sw_if_index={api_reply.sw_if_index}")
api_reply = vpp.api.vxlan_tunnel_v2_dump()
for vxlan in api_reply:
str = f"[{vxlan.sw_if_index}] instance {vxlan.instance} vni {vxlan.vni}"
str += " src {vxlan.src_address}:{vxlan.src_port} dst {vxlan.dst_address}:{vxlan.dst_port}")
print(str)
api_reply = vpp.api.vxlan_add_del_tunnel_v3(is_add=False, instance=100, vni=8298,
src_address="192.0.2.1", dst_address="192.0.2.254", decap_next_index=1)
if api_reply.retval == 0:
print(f"Deleted VXLAN tunnel with sw_if_index={api_reply.sw_if_index}")
```
Many of the APIs in VPP will have create and delete in the same method, mostly by specifying the
operation with an `is_add` argument like here. I think it's kind of nice because it makes the
creation and deletion symmetric, even though the deletion needs to specify a fair bit more than
strictly necessary: the _instance_ uniquely identifies the tunnel and should have been enough.
The output of this [[CRUD](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete)] sequence
(which stands for **C**reate, **R**ead, **U**pdate, **D**elete, in case you haven't come across that
acronym yet) then looks like this:
```
pim@vpp0-0:~/vpp_papi_examples$ ./04-vxlan.py
VPP version is 24.02-rc0~46-ga16463610
Created VXLAN tunnel with sw_if_index=18
[18] instance 100 vni 8298 src 192.0.2.1:4789 dst 192.0.2.254:4789
Deleted VXLAN tunnel with sw_if_index=18
```
### Listening to Events
But wait, there's more! Just one more thing, I promise. Way in the beginning of this article, I
mentioned that there is a special variant of the **Dump/Detail** pattern, and that's the **Events**
pattern. With the VPP API client, first I register a single callback function, and then I can
enable/disable events to trigger this callback.
{{< image width="80px" float="left" src="/assets/vpp-papi/warning.png" alt="Warning" >}}
One important note to this: enabling this callback will spawn a new (Python) thread so that the main
program can continue to execute. Because of this, all the standard care has to be taken to make the
program thread-aware. Make sure to pass information from the events-thread to the main-thread in a
safe way!
Let me demonstrate this powerful functionality with a program that listens on
`want_interface_events` which is defined in `interface.api`:
```
service {
rpc want_interface_events returns want_interface_events_reply
events sw_interface_event;
};
define sw_interface_event
{
u32 client_index;
u32 pid;
vl_api_interface_index_t sw_if_index;
vl_api_if_status_flags_t flags;
bool deleted;
};
```
Here's a complete program, shebang and all, that accomplishes this in a minimalistic way:
```python
#!/usr/bin/env python3
import time
from vpp_papi import VPPApiClient, VPPApiJSONFiles, VppEnum
def sw_interface_event(msg):
print(msg)
def vpp_event_callback(msg_name, msg):
if msg_name == "sw_interface_event":
sw_interface_event(msg)
else:
print(f"Received unknown callback: {msg_name} => {msg}")
vpp_api_dir = VPPApiJSONFiles.find_api_dir([])
vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir)
vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock")
vpp.connect("ipng-client")
vpp.register_event_callback(vpp_event_callback)
vpp.api.want_interface_events(enable_disable=True, pid=8298)
api_reply = vpp.api.show_version()
print(f"VPP version is {api_reply.version}")
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
pass
```
## Results
After all of this deep-diving, all that's left is for me to demonstrate the API by means of this
little asciinema [[screencast](/assets/vpp-papi/vpp_papi_clean.cast)] - I hope you enjoy it as much
as I enjoyed creating it:
{{< image src="/assets/vpp-papi/vpp_papi_clean.gif" alt="Asciinema" >}}
Note to self:
```
$ asciinema-edit quantize --range 0.18,0.8 --range 0.5,1.5 --range 1.5 \
vpp_papi.cast > clean.cast
$ Insert the ANSI colorcodes from the mac's terminal into clean.cast's header:
"theme":{"fg": "#ffffff","bg":"#000000",
"palette":"#000000:#990000:#00A600:#999900:#0000B3:#B300B3:#999900:#BFBFBF:
#666666:#F60000:#00F600:#F6F600:#0000F6:#F600F6:#00F6F6:#F6F6F6"}
$ agg --font-size 18 clean.cast clean.gif
$ gifsicle --lossy=80 -k 128 -O2 -Okeep-empty clean.gif -o vpp_papi_clean.gif
```

View File

@ -0,0 +1,359 @@
---
date: "2024-02-10T10:17:54Z"
title: VPP on FreeBSD - Part 1
---
# About this series
{{< image width="300px" float="right" src="/assets/freebsd-vpp/freebsd-logo.png" alt="FreeBSD" >}}
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. Over the years, folks have asked me regularly "What about BSD?" and to
my surprise, late last year I read an announcement from the _FreeBSD Foundation_
[[ref](https://freebsdfoundation.org/blog/2023-in-review-software-development/)] as they looked back
over 2023 and forward to 2024:
> ***Porting the Vector Packet Processor to FreeBSD***
>
> Vector Packet Processing (VPP) is an open-source, high-performance user space networking stack
> that provides fast packet processing suitable for software-defined networking and network function
> virtualization applications. VPP aims to optimize packet processing through vectorized operations
> and parallelism, making it well-suited for high-speed networking applications. In November of this
> year, the Foundation began a contract with Tom Jones, a FreeBSD developer specializing in network
> performance, to port VPP to FreeBSD. Under the contract, Tom will also allocate time for other
> tasks such as testing FreeBSD on common virtualization platforms to improve the desktop
> experience, improving hardware support on arm64 platforms, and adding support for low power idle
> on Intel and arm64 hardware.
I reached out to Tom and introduced myself -- and IPng Networks -- and offered to partner. Tom knows
FreeBSD very well, and I know VPP very well. And considering lots of folks have asked me that loaded
"What about BSD?" question, I think a reasonable answer might now be: Coming up!
Tom will be porting VPP to FreeBSD, and I'll be providing a test environment with a few VMs,
physical machines with varying architectures (think single-numa, AMD64 and Intel platforms).
In this first article, let's take a look at tablestakes: installing FreeBSD 14.0-RELEASE and doing
all the little steps necessary to get VPP up and running.
## My test setup
Tom and I will be using two main test environments. The first is a set of VMs running on QEMU, which
we can do functional testing on, by configuring a bunch of VPP routers with a set of normal FreeBSD
hosts attached to them. The second environment will be a few Supermicro bare metal servers that
we'll use for performance testing, notably to compare the FreeBSD kernel routing, fancy features
like `netmap`, and of course VPP itself. I do intend to do some side-by-side comparisons between
Debian and FreeBSD when they run VPP.
{{< image width="100px" float="left" src="/assets/freebsd-vpp/brain.png" alt="brain" >}}
If you know me a little bit, you'll know that I typically forget how I did a thing, so I'm using
this article for others as well as myself in case I want to reproduce this whole thing 5 years down
the line. Oh, and if you don't know me at all, now you know my brain, pictured left, is not too
different from a leaky sieve.
### VMs: IPng Lab
I really like the virtual machine environment that the [[IPng Lab]({% post_url 2022-10-14-lab-1 %})]
provides. So my very first step is to grab an UFS based image like [[these
ones](https://download.freebsd.org/releases/VM-IMAGES/14.0-RELEASE/amd64/Latest/)], and I prepare a
lab image. This goes roughly as follows --
1. Download the UFS qcow2 and `unxz` it.
1. Create a 10GB ZFS blockdevice `zfs create ssd-vol0/vpp-proto-freebsd-disk0 -V10G`
1. Make a copy of my existing vpp-proto-bookworm libvirt config, and edit it with new MAC addresses,
UUID and hostname (essentially just an `s/bookworm/freebsd/g`
1. Boot the VM once using VNC, to add serial booting to `/boot/loader.conf`
1. Finally, install a bunch of stuff that I would normally use on a FreeBSD machine:
* A user account 'pim' and 'ipng', set the 'root' password
* A bunch of packages (things like vim, bash, python3, rsync)
* SSH host keys and `authorized_keys` files
* A sensible `rc.conf` that DHCPs on its first network card `vtnet0`
I notice that FreeBSD has something pretty neat in `rc.conf`, called `growfs_enable`, which will
take a look at the total disk size available in slice 4 (the one that contains the main filesystem),
and if the disk has free space beyond the end of the partition, it'll slurp it up and resize the
filesystem to fit. Reading the `/etc/rc.d/growfs` file, I see that this works for both ZFS and UFS.
A chef's kiss that I found super cool!
Next, I take a snapshot of the disk image and add it to the Lab's `zrepl` configuration, so that
this base image gets propagated to all hypervisors, the result is a nice 10GB large base install
that boots off of serial.
```
pim@hvn0-chbtl0:~$ zfs list -t all | grep vpp-proto-freebsd
ssd-vol0/vpp-proto-freebsd-disk0 13.0G 45.9G 6.14G -
ssd-vol0/vpp-proto-freebsd-disk0@20240206-release 614M - 6.07G -
ssd-vol0/vpp-proto-freebsd-disk0@20240207-release 3.95M - 6.14G -
ssd-vol0/vpp-proto-freebsd-disk0@20240207-2 0B - 6.14G -
ssd-vol0/vpp-proto-freebsd-disk0#zrepl_CURSOR_G_760881003460c452_J_source-vpp-proto - - 6.14G -
```
One note for the pedants -- the kernel that ships with Debian, for some reason I don't quite
understand, does not come with an UFS kernel module that allows to mount these filesystems
read-write. Maybe this is because there are a few different flavors of UFS out there, and the
maintainer of that kernel module is not comfortable enabling write-mode on all of them. I
don't know, but my use case isn't critical as my build will just copy a few files on the
otherwise ephemeral ZFS cloned filesystem.
So off I go, asking Summer to build me a Linux 6.1 kernel for Debian Bookworm (which is what the
hypervisors are running). For those following along at home, here's how that looked like for me:
```
pim@summer:/usr/src$ sudo apt-get install build-essential linux-source bc kmod cpio flex \
libncurses5-dev libelf-dev libssl-dev dwarves bison
pim@summer:/usr/src$ sudo apt install linux-source-6.1
pim@summer:/usr/src$ sudo tar xf linux-source-6.1.tar.xz
pim@summer:/usr/src$ cd linux-source-6.1/
pim@summer:/usr/src/linux-source-6.1$ sudo cp /boot/config-6.1.0-16-amd64 .config
pim@summer:/usr/src/linux-source-6.1$ cat << EOF | sudo tee -a .config
CONFIG_UFS_FS=m
CONFIG_UFS_FS_WRITE=y
EOF
pim@summer:/usr/src/linux-source-6.1$ sudo make menuconfig
pim@summer:/usr/src/linux-source-6.1$ sudo make -j`nproc` bindeb-pkg
```
Finally, I add a new LAB overlay type called `freebsd` to the Python/Jinja2 tool I built, which I
use to create and maintain the LAB hypervisors. If you're curious about this part, take a look at
the [[article]({% post_url 2022-10-14-lab-1 %})] I wrote about the environment. I reserve LAB #2
running on `hvn2.lab.ipng.ch` for the time being, as LAB #0 and #1 are in use by other projects. To
cut to the chase, here's what I type to generate the overlay and launch a LAB using the FreeBSD I
just made. There's not much in the overlay, really just some templated `rc.conf` to set the correct
hostname and mgmt IPv4/IPv6 addresses and so on.
```
pim@lab:~/src/lab$ find overlays/freebsd/ -type f
overlays/freebsd/common/home/ipng/.ssh/authorized_keys.j2
overlays/freebsd/common/etc/rc.local.j2
overlays/freebsd/common/etc/rc.conf.j2
overlays/freebsd/common/etc/resolv.conf.j2
overlays/freebsd/common/root/lab-build/perms
overlays/freebsd/common/root/.ssh/authorized_keys.j2
pim@lab:~/src/lab$ ./generate --host hvn2.lab.ipng.ch --overlay freebsd
pim@lab:~/src/lab$ export BASE=vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-freebsd-disk0@20240207-2
pim@lab:~/src/lab$ OVERLAY=freebsd LAB=2 ./create
pim@lab:~/src/lab$ LAB=2 ./command start
```
After rebooting the hypervisors with their new UFS2-write-capable kernel, I can finish the job and
create the lab VMs. The `create` call above first makes a ZFS _clone_ of the base image, then mounts
it, rsyncs the generated overlay files over it, then creates a ZFS _snapshot_ called `@pristine`,
before booting up the seven virtual machines that comprise this spiffy new FreeBSD lab:
{{< image src="/assets/freebsd-vpp/LAB v2.svg" alt="Lab Setup" >}}
I decide to park the LAB for now, as that beautiful daisy-chain of `vpp2-0` - `vpp2-3` routers will
first need a working VPP install, which I don't quite have yet.
### Bare Metal
Next, I take three spare Supermicro SYS-5018D-FN8T, which have the following specs:
* Full IPMI support (power, serial-over-lan and kvm-over-ip with HTML5), on a dedicated network port.
* A 4-core, 8-thread Xeon D1518 CPU which runs at 35W TDP
* Two independent Intel i210 NICs (Gigabit)
* A Quad Intel i350 NIC (Gigabit)
* Two Intel X552 (TenGigabitEthernet)
* Two Intel X710-XXV (TwentyFiveGigabitEthernet) ports in the PCIe v3.0 x8 slot
* m.SATA 120G boot SSD
* 2x16GB of ECC RAM
These were still arranged in a test network from when Adrian and I worked on the [[VPP MPLS]({%
post_url 2023-05-07-vpp-mpls-1 %})] project together, and back then I called the three machines
`France`, `Belgium` and `Netherlands`. I decide to reuse that, and save myself some recabling.
Using IPMI, I install the `France` server with FreeBSD, while the other two, for now, are still
running Debian. This can be useful for (a) side by side comparison tests and (b) to be able to
quickly run some T-Rex loadtests.
{{< image src="/assets/freebsd-vpp/freebsd-bootscreen.png" alt="FreeBSD" >}}
I have to admit - I **love** Supermicro's IPMI implementation. Being able to plop in an ISO over
Samba, and the boot on VGA, including into the BIOS to set/change things, and then completely
reinstall while hanging out on the couch while drinking tea, absolutetey
## Starting Point
I use the base image I described above to clone a beefy VM for building and development purposes. I
give that machine 32GB of RAM and 24 cores on one of IPng's production hypervisors. I spent some
time with Tom this week to go over a few details about the build, and he patiently described where
he's at with the porting. It's not done yet, but he has good news: it does compile cleanly on his
machine, so there is hope for me yet! He has prepared a GitHub repository with all of the changes
staged - and he will be sequencing them out one by one to merge upstream. In case you want to follow
along with his work, take a look at this [[Gerrit
search](https://gerrit.fd.io/r/q/repo:vpp+owner:thj%2540freebsd.org)].
First, I need to go build a whole bunch of stuff. Here's a recap --
1. Download ports and kernel source
1. Build and install a GENERIC kernel
1. Build DPDK including its FreeBSD kernel modules `contigmem` and `nic_uio`
1. Build netmap `bridge` utility
1. Build VPP :)
To explain a little bit: Linux has _hugepages_ which are 2MB or 1GB memory pages. These come with a
significant performance benefit, mostly because the CPU will have a table called the _Translation
Lookaside Buffer_ or [[TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)] which keeps
a mapping between virtual and physical memory pages. If there is too much memory allocated to a
process, this TLB table thrashes which comes at a performance penalty. When allocating not the
standard 4kB pages, but larger 2MB or 1GB ones, this does not happen. For FreeBSD, the DPDK library
provides an equivalent kernel module, which is called `contigmem`.
Many (but not all!) DPDK poll mode drivers will remove the kernel network card driver and rebind the
network card to a _Userspace IO_ or _UIO_ driver. DPDK also ships one of these for FreeBSD, called
`nic_uio`. So my first three steps are compiling all of these things, including a standard DPDK
install from ports.
### Build: FreeBSD + DPDK
Building things on FreeBSD is all very well documented in the [[FreeBSD
Handbook](ttps://docs.freebsd.org/en/books/developers-handbook/kernelbuild/)].
In order to avoid filling up the UFS boot disk, I snuck in another SAS-12 SSD to get a bit
faster builds, and I mount `/usr/src` and `/usr/obj` on it.
Here's a recap of what I ended up doing to build a fresh GENERIC kernel and the `DPDK` port:
```
[pim@freebsd-builder ~]$ sudo zfs create -o mountpoint=/usr/src ssd-vol0/src
[pim@freebsd-builder ~]$ sudo zfs create -o mountpoint=/usr/obj ssd-vol0/obj
[pim@freebsd-builder ~]$ sudo git clone --branch stable/14 https://git.FreeBSD.org/src.git /usr/src
[pim@freebsd-builder /usr/src]$ sudo make buildkernel KERNCONF=GENERIC
[pim@freebsd-builder /usr/src]$ sudo make installkernel KERNCONF=GENERIC
[pim@freebsd-builder ~]$ sudo git clone https://git.FreeBSD.org/ports.git /usr/ports
[pim@freebsd-builder /usr/ports/net/dpdk ]$ sudo make install
```
I patiently answer a bunch of questions (all of them just with the default) when the build process
asks me what I want. DPDK is a significant project, and it pulls in lots of dependencies to build as
well. After what feels like an eternity, the builds are complete, and I have a kernel together with
kernel modules, as well as a bunch of handy DPDK helper utilities (like `dpdk-testpmd`) installed.
Just to set expectations -- the build took about an hour for me from start to finish (on a 32GB machine
with 24 vCPUs), so hunker down if you go this route.
**NOTE**: I wanted to see what I was being asked in this build process, but since I ended up answering
everything with the default, you can feel free to add `BATCH=yes` to the make of DPDK (and see the man
page of dpdk(7) for details).
### Build: contigmem and nic_uio
Using a few `sysctl` calls, I can configure four buffers of 1GB each, which will serve as my
equivalent _hugepages_ from Linux, and I add the following to `/boot/loader.conf`, so that these
contiguous regions are reserved early in the boot cycle, when memory is not yet fragmented:
```
hw.contigmem.num_buffers=4
hw.contigmem.buffer_size=1073741824
contigmem_load="YES"
```
To figure out which network devices to rebind to the _UIO_ driver, I can inspect the PCI bus with the
`pciconf` utility:
```
[pim@freebsd-builder ~]$ pciconf -vl | less
...
virtio_pci0@pci0:1:0:0: class=0x020000 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1041 subvendor=0x1af4 subdevice=0x1100
vendor = 'Red Hat, Inc.'
device = 'Virtio 1.0 network device'
class = network
subclass = ethernet
virtio_pci1@pci0:1:0:1: class=0x020000 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1041 subvendor=0x1af4 subdevice=0x1100
vendor = 'Red Hat, Inc.'
device = 'Virtio 1.0 network device'
class = network
subclass = ethernet
virtio_pci0@pci0:1:0:2: class=0x020000 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1041 subvendor=0x1af4 subdevice=0x1100
vendor = 'Red Hat, Inc.'
device = 'Virtio 1.0 network device'
class = network
subclass = ethernet
virtio_pci1@pci0:1:0:3: class=0x020000 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1041 subvendor=0x1af4 subdevice=0x1100
vendor = 'Red Hat, Inc.'
device = 'Virtio 1.0 network device'
class = network
subclass = ethernet
```
My virtio based network devices are on PCI location `1:0:0` -- `1:0:3` and I decide to take away the
last two, which makes my final loader configuration for the kernel:
```
[pim@freebsd-builder ~]$ cat /boot/loader.conf
kern.geom.label.disk_ident.enable=0
zfs_load=YES
boot_multicons=YES
boot_serial=YES
comconsole_speed=115200
console="comconsole,vidconsole"
hw.contigmem.num_buffers=4
hw.contigmem.buffer_size=1073741824
contigmem_load="YES"
nic_uio_load="YES"
hw.nic_uio.bdfs="1:0:2,1:0:3"
```
## Build: Results
Now that all of this is done, the machine boots with these drivers loaded, and I can see only my
first two network devices (`vtnet0` and `vtnet1`), while the other two are gone. This is good news,
because that means they are now under control of the DPDK `nic_uio` kernel driver, whohoo!
```
[pim@freebsd-builder ~]$ kldstat
Id Refs Address Size Name
1 28 0xffffffff80200000 1d36230 kernel
2 1 0xffffffff81f37000 4258 nic_uio.ko
3 1 0xffffffff81f3c000 5d5618 zfs.ko
4 1 0xffffffff82513000 5378 contigmem.ko
5 1 0xffffffff82c18000 3250 ichsmb.ko
6 1 0xffffffff82c1c000 2178 smbus.ko
7 1 0xffffffff82c1f000 430c virtio_console.ko
8 1 0xffffffff82c24000 22a8 virtio_random.ko
```
## Build: VPP
Tom has prepared a branch on his GitHub account, which poses a few small issues with the build.
Notably, we have to use a few GNU tools like `gmake`. But overall, I find the build is very straight
forward - kind of looking like this:
```
[pim@freebsd-builder ~]$ sudo pkg install py39-ply git gmake gsed cmake libepoll-shim gdb python3 ninja
[pim@freebsd-builder ~/src]$ git clone git@github.com:adventureloop/vpp.git
[pim@freebsd-builder ~/src/vpp]$ git checkout freebsd-vpp
[pim@freebsd-builder ~]$ gmake install-dep
[pim@freebsd-builder ~]$ gmake build
```
# Results
Now, taking into account that not everything works (for example there isn't a packaging yet, let
alone something as fancy as a port), and that there's a bit of manual tinkering going on, let me
show you at least the absolute gem that is this screenshot:
{{< image src="/assets/freebsd-vpp/freebsd-vpp.png" alt="VPP :-)" >}}
The (debug build) VPP instance started, the DPDK plugin loaded, and it found the two devices that
were bound by the newly installed `nic_uio` driver. Setting an IPv4 address on one of these
interfaces works, and I can ping another machine on the LAN connected to Gi10/0/2, which I find
dope.
Hello, World!
# What's next ?
There's a lot of ground to cover with this port. While Tom munches away at the Gerrits he has
stacked up, I'm going to start kicking the tires on the FreeBSD machines. I showed in this article
the tablestakes preparation, giving a FreeBSD lab on the hypervisors, a build machine that has DPDK,
kernel and VPP in a somewhat working state (with two NICs in VirtIO), and I installed a Supermicro
bare metal machine to do the same.
In a future set of articles in this series, I will:
* Do a comparative loadtest between FreeBSD kernel, Netmap, VPP+Netmap, and VPP+DPDK
* Take a look at how FreeBSD stacks up against Debian on the the same machine
* Do a bit of functional testing, to ensure dataplane functionality is in place
A few things will need some attention:
* Some Linux details have leaked, for example `show cpu` and `show pci` in VPP
* Linux Control Plane uses TAP devices which Tom has mentioned may need some work
* Similarly, Linux Control Plane netlink handling may or may not work as expected in FreeBSD
* Build and packaging, obviously there is no `make pkg-deb`

View File

@ -0,0 +1,501 @@
---
date: "2024-02-17T12:17:54Z"
title: VPP on FreeBSD - Part 2
---
# About this series
{{< image width="300px" float="right" src="/assets/freebsd-vpp/freebsd-logo.png" alt="FreeBSD" >}}
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. Over the years, folks have asked me regularly "What about BSD?" and to
my surprise, late last year I read an announcement from the _FreeBSD Foundation_
[[ref](https://freebsdfoundation.org/blog/2023-in-review-software-development/)] as they looked back
over 2023 and forward to 2024:
> ***Porting the Vector Packet Processor to FreeBSD***
>
> Vector Packet Processing (VPP) is an open-source, high-performance user space networking stack
> that provides fast packet processing suitable for software-defined networking and network function
> virtualization applications. VPP aims to optimize packet processing through vectorized operations
> and parallelism, making it well-suited for high-speed networking applications. In November of this
> year, the Foundation began a contract with Tom Jones, a FreeBSD developer specializing in network
> performance, to port VPP to FreeBSD. Under the contract, Tom will also allocate time for other
> tasks such as testing FreeBSD on common virtualization platforms to improve the desktop
> experience, improving hardware support on arm64 platforms, and adding support for low power idle
> on Intel and arm64 hardware.
In my first [[article]({% post_url 2024-02-10-vpp-freebsd-1 %})], I wrote a sort of a _hello world_
by installing FreeBSD 14.0-RELEASE on both a VM and a bare metal Supermicro, and showed that Tom's
VPP branch compiles, runs and pings. In this article, I'll take a look at some comparative
performance numbers.
## Comparing implementations
FreeBSD has an extensive network stack, including regular _kernel_ based functionality such as
routing, filtering and bridging, a faster _netmap_ based datapath, including some userspace
utilities like a _netmap_ bridge, and of course completely userspace based dataplanes, such as the
VPP project that I'm working on here. Last week, I learned that VPP has a _netmap_ driver, and from
previous travels I am already quite familiar with its _DPDK_ based forwarding. I decide to do a
baseline loadtest for each of these on the Supermicro Xeon-D1518 that I installed last week. See the
[[article]({% post_url 2024-02-10-vpp-freebsd-1 %})] for details on the setup.
The loadtests will use a common set of different configurations, using Cisco T-Rex's default
benchmark profile called `bench.py`:
1. **var2-1514b**: Large Packets, multiple flows with modulating source and destination IPv4
addresses, often called an 'iperf test', with packets of 1514 bytes.
1. **var2-imix**: Mixed Packets, multiple flows, often called an 'imix test', which includes a
bunch of 64b, 390b and 1514b packets.
1. **var2-64b**: Small Packets, still multiple flows, 64 bytes, which allows for multiple receive
queues and kernel or application threads.
1. **64b**: Small Packets, but now single flow, often called 'linerate test', with a packet size
of 64 bytes, limiting to one receive queue.
Each of these four loadtests might occur in only undirectionally (port0 -> port1) or bidirectionally
(port0 <-> port1). This yields eight different loadtests, each taking about 8 minutes. I put the kettle
on and get underway.
### FreeBSD 14: Kernel Bridge
The machine I'm testing has a quad-port Intel i350 (1Gbps copper, using the FreeBSD `igb(4)` driver),
a dual-port Intel X522 (10Gbps SFP+, using the `ix(4)` driver), and a dual-port Intel i710-XXV
(25Gbps SFP28, using the `ixl(4)` driver). I decide to live it up a little, and choose the 25G ports
for my loadtests today, even if I think this machine with its relatively low-end Xeon-D1518 CPU
will struggle a little bit at very high packet rates. No pain, no gain, _amirite_?
I take my fresh FreeBSD 14.0-RELEASE install, without any tinkering other than compiling a GENERIC
kernel that has support for the DPDK modules I'll need later. For my first loadtest, I create a
kernel based bridge as follows, just tying the two 25G interfaces together:
```
[pim@france /usr/obj]$ uname -a
FreeBSD france 14.0-RELEASE FreeBSD 14.0-RELEASE #0: Sat Feb 10 22:18:51 CET 2024 root@france:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
[pim@france ~]$ dmesg | grep ixl
ixl0: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k> mem 0xf8000000-0xf8ffffff,0xf9008000-0xf900ffff irq 16 at device 0.0 on pci7
ixl1: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k> mem 0xf7000000-0xf7ffffff,0xf9000000-0xf9007fff irq 16 at device 0.1 on pci7
[pim@france ~]$ sudo ifconfig bridge0 create
[pim@france ~]$ sudo ifconfig bridge0 addm ixl0 addm ixl1 up
[pim@france ~]$ sudo ifconfig ixl0 up
[pim@france ~]$ sudo ifconfig ixl1 up
[pim@france ~]$ ifconfig bridge0
bridge0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
options=0
ether 58:9c:fc:10:6c:2e
id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
member: ixl1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 4 priority 128 path cost 800
member: ixl0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 3 priority 128 path cost 800
groups: bridge
nd6 options=9<PERFORMNUD,IFDISABLED>
```
One thing that I quickly realize, is that FreeBSD, when using hyperthreading, does have 8 threads
available, but only 4 of them participate in forwarding. When I put the machine under load, I see a
curious 399% spent in _kernel_ while I see 402% in _idle_:
{{< image src="/assets/freebsd-vpp/top-kernel-bridge.png" alt="FreeBSD top" >}}
When I then do a single-flow unidirectional loadtest, the expected outcome is that only one CPU
participates (100% in _kernel_ and 700% in _idle_) and if I perform a single-flow bidirectional
loadtest, my expectations are confirmed again, seeing two CPU threads do the work (200% in _kernel_
and 600% in _idle_).
While the math checks out, the performance is a little bit less impressive:
| Type | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Rate |
| ---- | --------- | ----------- | -------- | --------- |
| vm=var2,size=1514 | Unidirectional | 2.02Mpps | 24.77Gbps | 99% |
| vm=var2,size=imix | Unidirectional | 3.48Mpps | 10.23Gbps | 43% |
| vm=var2,size=64 | Unidirectional | 3.61Mpps | 2.43Gbps | 9.7% |
| size=64 | Unidirectional | 1.22Mpps | 0.82Gbps | 3.2% |
| vm=var2,size=1514 | Bidirectional | 3.77Mpps | 46.31Gbps | 93% |
| vm=var2,size=imix | Bidirectional | 3.81Mpps | 11.22Gbps | 24% |
| vm=var2,size=64 | Bidirectional | 4.02Mpps | 2.69Gbps | 5.4% |
| size=64 | Bidirectional | 2.29Mpps | 1.54Gbps | 3.1% |
***Conclusion***: FreeBSD's kernel on this Xeon-D1518 processor can handle about 1.2Mpps per CPU
thread, and I can use only four of them. FreeBSD is happy to forward big packets, and I can
reasonably reach 2x25Gbps but once I start ramping up the packets/sec by lowering the packet size,
things very quickly deteriorate.
### FreeBSD 14: netmap Bridge
Tom pointed out a tool in the source tree, called the _netmap bridge_ originally written by Luigi
Rizzo and Matteo Landi. FreeBSD ships the source code, but you can also take a look at their GitHub
repository [[ref](https://github.com/luigirizzo/netmap/blob/master/apps/bridge/bridge.c)].
What is _netmap_ anyway? It's a framework for extremely fast and efficient packet I/O for userspace
and kernel clients, and for Virtual Machines. It runs on FreeBSD, Linux and some versions of
Windows. As an aside, my buddy Pavel from FastNetMon pointed out a blogpost from 2015 in which
Cloudflare folks described a way to do DDoS mitigation on Linux using traffic classification to
program the network cards to move certain offensive traffic to a dedicated hardware queue, and
service that queue from a _netmap_ client. If you're curious (I certainly was!), you might take a
look at that cool write-up
[[here](https://blog.cloudflare.com/single-rx-queue-kernel-bypass-with-netmap)].
I compile the code and put it to work, and the man-page tells me that I need to fiddle with the
interfaces a bit. They need to be:
* set to _promiscuous_, which makes sense as they have to receive ethernet frames sent to MAC
addresses other than their own
* turn off any hardware offloading, notably `-rxcsum -txcsum -tso4 -tso6 -lro`
* my user needs write permission to `/dev/netmap` to bind the interfaces from userspace.
```
[pim@france /usr/src/tools/tools/netmap]$ make
[pim@france /usr/src/tools/tools/netmap]$ cd /usr/obj/usr/src/amd64.amd64/tools/tools/netmap
[pim@france .../tools/netmap]$ sudo ifconfig ixl0 -rxcsum -txcsum -tso4 -tso6 -lro promisc
[pim@france .../tools/netmap]$ sudo ifconfig ixl1 -rxcsum -txcsum -tso4 -tso6 -lro promisc
[pim@france .../tools/netmap]$ sudo chmod 660 /dev/netmap
[pim@france .../tools/netmap]$ ./bridge -i netmap:ixl0 -i netmap:ixl1
065.804686 main [290] ------- zerocopy supported
065.804708 main [297] Wait 4 secs for link to come up...
075.810547 main [301] Ready to go, ixl0 0x0/4 <-> ixl1 0x0/4.
```
{{< image width="80px" float="left" src="/assets/freebsd-vpp/warning.png" alt="Warning" >}}
I start my first loadtest, which pretty immediately fails. It's an interesting behavior pattern which
I've not seen before. After staring at the problem, and reading the code of `bridge.c`, which is a
remarkably straight forward program, I restart the bridge utility, and traffic passes again but only
for a little while. Whoops!
I took a [[screencast](/assets/freebsd-vpp/netmap_bridge.cast)] in case any kind soul on freebsd-net
wants to take a closer look at this:
{{< image src="/assets/freebsd-vpp/netmap_bridge.gif" alt="FreeBSD netmap Bridge" >}}
I start a bit of trial and error in which I conclude that if I send **a lot** of traffic (like 10Mpps),
forwarding is fine; but if I send **a little** traffic (like 1kpps), at some point forwarding stops
alltogether. So while it's not great, this does allow me to measure the total throughput just by
sending a lot of traffic, say 30Mpps, and seeing what amount comes out the other side.
Here I go, and I'm having fun:
| Type | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Rate |
| ---- | --------- | ----------- | -------- | --------- |
| vm=var2,size=1514 | Unidirectional | 2.04Mpps | 24.72Gbps | 100% |
| vm=var2,size=imix | Unidirectional | 8.16Mpps | 23.76Gbps | 100% |
| vm=var2,size=64 | Unidirectional | 10.83Mpps | 5.55Gbps | 29% |
| size=64 | Unidirectional | 11.42Mpps | 5.83Gbps | 31% |
| vm=var2,size=1514 | Bidirectional | 3.91Mpps | 47.27Gbps | 96% |
| vm=var2,size=imix | Bidirectional | 11.31Mpps | 32.74Gbps | 77% |
| vm=var2,size=64 | Bidirectional | 11.39Mpps | 5.83Gbps | 15% |
| size=64 | Bidirectional | 11.57Mpps | 5.93Gbps | 16% |
***Conclusion***: FreeBSD's _netmap_ implementation is also bound by packets/sec, and in this
setup, the Xeon-D1518 machine is capable of forwarding roughly 11.2Mpps. What I find cool is that
single flow or multiple flows doesn't seem to matter that much, in fact bidirectional 64b single
flow loadtest was most favorable at 11.57Mpps, which is _an order of magnitude_ better than using just
the kernel (which clocked in at 1.2Mpps).
### FreeBSD 14: VPP with netmap
It's good to have a baseline on this machine on how the FreeBSD kernel itself performs. But of
course this series is about Vector Packet Processing, so I now turn my attention to the VPP branch
that Tom shared with me. I wrote a bunch of details about the VM and bare metal install in my
[[first article]({% post_url 2024-02-10-vpp-freebsd-1 %})] so I'll just go straight to the
configuration parts:
```
DBGvpp# create netmap name ixl0
DBGvpp# create netmap name ixl1
DBGvpp# set int state netmap-ixl0 up
DBGvpp# set int state netmap-ixl1 up
DBGvpp# set int l2 xconnect netmap-ixl0 netmap-ixl1
DBGvpp# set int l2 xconnect netmap-ixl1 netmap-ixl0
DBGvpp# show int
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
local0 0 down 0/0/0/0
netmap-ixl0 1 up 9000/0/0/0 rx packets 25622
rx bytes 1537320
tx packets 25437
tx bytes 1526220
netmap-ixl1 2 up 9000/0/0/0 rx packets 25437
rx bytes 1526220
tx packets 25622
tx bytes 1537320
```
At this point I can pretty much rule out that the _netmap_ `bridge.c` is the issue, because a
few seconds after introducing 10Kpps of traffic and seeing it successfully pass, the loadtester
receives no more packets, even though T-Rex is still sending it. However, about a minute later
I can _also_ see the RX **and** TX counters continue to increase in the VPP dataplane:
```
DBGvpp# show int
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
local0 0 down 0/0/0/0
netmap-ixl0 1 up 9000/0/0/0 rx packets 515843
rx bytes 30950580
tx packets 515657
tx bytes 30939420
netmap-ixl1 2 up 9000/0/0/0 rx packets 515657
rx bytes 30939420
tx packets 515843
tx bytes 30950580
```
.. and I can see that every packet that VPP received is accounted for: interface `ixl0` has received
515843 packets, and `ixl1` claims to have transmitted _exactly_ that amount of packets. So I think
perhaps they are getting lost somewhere on egress between the kernel and the Intel i710-XXV network
card.
However, counter to the previous case, I cannot sustain any reasonable amount of traffic, be it
1Kpps, 10Kpps or 10Mpps, the system pretty consistently comes to a halt mere seconds after
introducing the load. Restarting VPP makes it forward traffic again for a few seconds, just to end
up in the same upset state. I don't learn much.
***Conclusion***: This setup with VPP using _netmap_ does not yield results, for the moment. I have a
suspicion that whatever the cause is of the _netmap_ bridge in the previous test, is likely also the
culprit for this test.
### FreeBSD 14: VPP with DPDK
But not all is lost - I have one test left, and judging by what I learned last week when bringing up
the first test environment, this one is going to be a fair bit better. In my previous loadtests, the
network interfaces were on their usual kernel driver (`ixl(4)` in the case of the Intel i710-XXV
interfaces), but now I'm going to mix it up a little, and rebind these interfaces to a specific DPDK
driver called `nic_uio(4)` which stands for _Network Interface Card Userspace Input/Output_:
```
[pim@france ~]$ cat < EOF | sudo tee -a /boot/loader.conf
nic_uio_load="YES"
hw.nic_uio.bdfs="6:0:0,6:0:1"
EOF
```
After I reboot, the network interfaces are gone from the output of `ifconfig(8)`, which is good. I
start up VPP with a minimal config file [[ref](/assets/freebsd-vpp/startup.conf)], which defines
three worker threads and starts DPDK with 3 RX queues and 4 TX queues. It's a common question why
there would be one more TX queue. The explanation is that in VPP, there is one (1) _main_ thread and
zero or more _worker_ threads. If the _main_ thread wants to send traffic (for example, in a plugin
like _LLDP_ which sends periodic announcements), it would be most efficient to use a transmit queue
specific to that _main_ thread. Any return traffic will be picked up by the _DPDK Process_ on worker
threads (as _main_ does not have one of these). That's why the general rule num(TX) = num(RX)+1.
```
[pim@france ~/src/vpp]$ export STARTUP_CONF=/home/pim/src/startup.conf
[pim@france ~/src/vpp]$ gmake run-release
vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/0 TwentyFiveGigabitEthernet6/0/1
vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/1 TwentyFiveGigabitEthernet6/0/0
vpp# set int state TwentyFiveGigabitEthernet6/0/0 up
vpp# set int state TwentyFiveGigabitEthernet6/0/1 up
vpp# show int
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
TwentyFiveGigabitEthernet6/0/0 1 up 9000/0/0/0 rx packets 11615035382
rx bytes 1785998048960
tx packets 700076496
tx bytes 161043604594
TwentyFiveGigabitEthernet6/0/1 2 up 9000/0/0/0 rx packets 700076542
rx bytes 161043674054
tx packets 11615035440
tx bytes 1785998136540
local0 0 down 0/0/0/0
```
And with that, the dataplane shoots to life and starts forwarding (lots of) packets. To my great
relief, sending either 1kpps or 1Mpps "just works". I can run my loadtest as per normal, first with
1514 byte packets, then imix, then 64 byte packets, and finally single-flow 64 byte packets. And of
course, both unidirectionally and bidirectionally.
I take a look at the system load while the loadtests are running:
{{< image src="/assets/freebsd-vpp/top-vpp-dpdk.png" alt="FreeBSD top" >}}
It is fully expected that the VPP process is spinning 300% +epsilon of CPU time. This is because it
has started three _worker_ threads, and these are execuing the DPDK _Poll Mode Driver_ which is
essentially a tight loop that asks the network cards for work, and if there are any packets
arriving, executes on that work. As such, each _worker_ thread is always burning 100% of its
assigned CPU.
That said, I can take a look at finer grained statistics in the dataplane itself:
```
vpp# show run
Thread 0 vpp_main (lcore 0)
Time .9, 10 sec internal node vector rate 0.00 loops/sec 297041.19
vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
ip4-full-reassembly-expire-wal any wait 0 0 18 2.39e3 0.00
ip6-full-reassembly-expire-wal any wait 0 0 18 3.08e3 0.00
unix-cli-process-0 active 0 0 9 7.62e4 0.00
unix-epoll-input polling 13066 0 0 1.50e5 0.00
---------------
Thread 1 vpp_wk_0 (lcore 1)
Time .9, 10 sec internal node vector rate 12.38 loops/sec 1467742.01
vector rates in 5.6294e6, out 5.6294e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TwentyFiveGigabitEthernet6/0/1 active 399663 5047800 0 2.20e1 12.63
TwentyFiveGigabitEthernet6/0/1 active 399663 5047800 0 9.54e1 12.63
dpdk-input polling 1531252 5047800 0 1.45e2 3.29
ethernet-input active 399663 5047800 0 3.97e1 12.63
l2-input active 399663 5047800 0 2.93e1 12.63
l2-output active 399663 5047800 0 2.53e1 12.63
unix-epoll-input polling 1494 0 0 3.09e2 0.00
(et cetera)
```
I showed only one _worker_ thread's output, but there are actually three _worker_ threads, and they are
all doing similar work, because they are picking up 33% of the traffic each assigned to the three RX
queues in the network card.
While the overall CPU load is 300%, here I can see a different picture. Thread 0 (the _main_ thread)
is doing essentially ~nothing. It is polling a set of unix sockets in the node called
`unix-epoll-input`, but other than that, _main_ doesn't have much on its plate. Thread 1 however is
a _worker_ thread, and I can see that it is busy doing work:
* `dpdk-input`: it's polling the NIC for work, it has been called 1.53M times, and in total it has
handled just over 5.04M _vectors_ (which are packets). So I can derive, that each time the _Poll
Mode Driver_ gives work, on average there are 3.29 _vectors_ (packets), and each packet is
taking about 145 CPU clocks.
* `ethernet-input`: The DPDK vectors are all ethernet frames coming from the loadtester. Seeing as
I have cross connected all traffic from Tf6/0/0 to Tf6/0/1 and vice-versa, VPP knows that it
should handle the packets in the L2 forwarding path.
* `l2-input` is called with the (list of N) ethernet frames, which all get cross connected to the
output interface, in this case Tf6/0/1.
* `l2-output` prepares the ethernet frames for output into their egress interface.
* `TwentyFiveGigabitEthernet6/0/1-output` (**Note**: the name is truncated) If this were to have
been L3 traffic, this would be the place where the destination MAC address is inserted into the
ethernet frame, but since this is an L2 cross connect, the node simply passes the ethernet frames
through to the final egress node in DPDK.
* `TwentyFiveGigabitEthernet6/0/1-tx` (**Note**: the name is truncated) hands them to the DPDK
driver for marshalling on the wire.
Halfway through, I see that there's an issue with the distribution of ingress traffic over the
three workers, maybe you can spot it too:
```
---------------
Thread 1 vpp_wk_0 (lcore 1)
Time 56.7, 10 sec internal node vector rate 38.59 loops/sec 106879.84
vector rates in 7.2982e6, out 7.2982e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TwentyFiveGigabitEthernet6/0/0 active 6689553 206899956 0 1.34e1 30.93
TwentyFiveGigabitEthernet6/0/0 active 6689553 206899956 0 1.37e2 30.93
TwentyFiveGigabitEthernet6/0/1 active 6688572 206902836 0 1.45e1 30.93
TwentyFiveGigabitEthernet6/0/1 active 6688572 206902836 0 1.34e2 30.93
dpdk-input polling 7128012 413802792 0 8.77e1 58.05
ethernet-input active 13378125 413802792 0 2.77e1 30.93
l2-input active 6809002 413802792 0 1.81e1 60.77
l2-output active 6809002 413802792 0 1.68e1 60.77
unix-epoll-input polling 6954 0 0 6.61e2 0.00
---------------
Thread 2 vpp_wk_1 (lcore 2)
Time 56.7, 10 sec internal node vector rate 256.00 loops/sec 7702.68
vector rates in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TwentyFiveGigabitEthernet6/0/0 active 456112 116764672 0 1.27e1 256.00
TwentyFiveGigabitEthernet6/0/0 active 456112 116764672 0 2.64e2 256.00
TwentyFiveGigabitEthernet6/0/1 active 456112 116764672 0 1.39e1 256.00
TwentyFiveGigabitEthernet6/0/1 active 456112 116764672 0 2.74e2 256.00
dpdk-input polling 456112 233529344 0 1.41e2 512.00
ethernet-input active 912224 233529344 0 5.71e1 256.00
l2-input active 912224 233529344 0 3.66e1 256.00
l2-output active 912224 233529344 0 1.70e1 256.00
unix-epoll-input polling 445 0 0 9.59e2 0.00
---------------
Thread 3 vpp_wk_2 (lcore 3)
Time 56.7, 10 sec internal node vector rate 256.00 loops/sec 7742.43
vector rates in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TwentyFiveGigabitEthernet6/0/0 active 456113 116764928 0 8.94e0 256.00
TwentyFiveGigabitEthernet6/0/0 active 456113 116764928 0 2.81e2 256.00
TwentyFiveGigabitEthernet6/0/1 active 456113 116764928 0 9.54e0 256.00
TwentyFiveGigabitEthernet6/0/1 active 456113 116764928 0 2.72e2 256.00
dpdk-input polling 456113 233529856 0 1.61e2 512.00
ethernet-input active 912226 233529856 0 4.50e1 256.00
l2-input active 912226 233529856 0 2.93e1 256.00
l2-output active 912226 233529856 0 1.23e1 256.00
unix-epoll-input polling 445 0 0 1.03e3 0.00
```
Thread 1 (`vpp_wk_0`) is handling 7.29Mpps and moderately loaded, while Thread 2 and 3 are handling
each 4.11Mpps and are completely pegged. That said, the relative amount of CPU clocks they are
spending per packet is reasonably similar, but they don't quite add up:
* Thread 1 is doing 7.29Mpps and is spending on average 449 CPU cycles per packet. I get this
number by adding up all of the values in the _Clocks_ column, except for the `unix-epoll-input`
node. But that's somewhat strange, because this Xeon D1518 clocks at 2.2GHz -- and yet 7.29M *
449 is 3.27GHz. My experience (in Linux) is that these numbers actually line up quite well.
* Thread 2 is doing 4.12Mpps and is spending on average 816 CPU cycles per packet. This kind of
makes sense as the cycles/packet is roughly double that of thread 1, and the packet/sec is
roughly half ... and the total of 4.12M * 816 is 3.36GHz.
* I can see similarly values for thread 3: 4.12Mpps and also 819 CPU cycles per packet which
amounts to VPP self-reporting using 3.37GHz worth of cycles on this thread.
When I look at the thread to CPU placement, I get another surprise:
```
vpp# show threads
ID Name Type LWP Sched Policy (Priority) lcore Core Socket State
0 vpp_main 100346 (nil) (n/a) 0 42949674294967
1 vpp_wk_0 workers 100473 (nil) (n/a) 1 42949674294967
2 vpp_wk_1 workers 100474 (nil) (n/a) 2 42949674294967
3 vpp_wk_2 workers 100475 (nil) (n/a) 3 42949674294967
vpp# show cpu
Model name: Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz
Microarch model (family): [0x6] Broadwell ([0x56] Broadwell DE) stepping 0x3
Flags: sse3 pclmulqdq ssse3 sse41 sse42 avx rdrand avx2 bmi2 rtm pqm pqe
rdseed aes invariant_tsc
Base frequency: 2.19 GHz
```
The numbers in `show threads` are all messed up, and I don't quite know what to make of it yet. I
think the perhaps overly specific Linux implementation of the thread pool management is throwing off
FreeBSD a bit. Perhaps some profiling could be useful, so I make a note to discuss this with Tom or
the freebsd-net mailing list, who will know a fair bit more about this type of stuff on FreeBSD than
I do.
Anyway, functionally: this works. Performance wise: I have some questions :-) I let all eight
loadtests complete and without further ado, here's the results:
| Type | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Rate |
| ---- | --------- | ----------- | -------- | --------- |
| vm=var2,size=1514 | Unidirectional | 2.01Mpps | 24.45Gbps | 99% |
| vm=var2,size=imix | Unidirectional | 8.07Mpps | 23.42Gbps | 99% |
| vm=var2,size=64 | Unidirectional | 23.93Mpps | 12.25Gbps | 64% |
| size=64 | Unidirectional | 12.80Mpps | 6.56Gbps | 34% |
| vm=var2,size=1514 | Bidirectional | 3.91Mpps | 47.35Gbps | 86% |
| vm=var2,size=imix | Bidirectional | 13.38Mpps | 38.81Gbps | 82% |
| vm=var2,size=64 | Bidirectional | 15.56Mpps | 7.97Gbps | 21% |
| size=64 | Bidirectional | 20.96Mpps | 10.73Gbps | 28% |
***Conclusion***: I have to say: 12.8Mpps on a unidirectional 64b single-flow loadtest (thereby only
being able to make use of one DPDK worker), and 20.96Mpps on a bidirectional 64b single-flow
loadtest, is not too shabby. But seeing as one CPU thread can do 12.8Mpps, I would imagine that
three CPU threads would perform at 38.4Mpps or there-abouts, but I'm seeing only 23.9Mpps and some
unexplained variance in per-thread performance.
## Results
I learned a lot! Some hilights:
1. The _netmap_ implementation is not playing ball for the moment, as forwarding stops consistently, in
both the `bridge.c` as well as the VPP plugin.
1. It is clear though, that _netmap_ is a fair bit faster (11.4Mpps) than _kernel forwarding_ which came in at
roughly 1.2Mpps per CPU thread. What's a bit troubling is that _netmap_ doesn't seem to work
very well in VPP -- traffic forwarding also stops here.
1. DPDK performs quite well on FreeBSD, I manage to see a throughput of 20.96Mpps which is almost
twice the throughput of _netmap_, which is cool but I can't quite explain the stark variance
in throughput between the worker threads. Perhaps VPP is placing the workers on hyperthreads?
Perhaps an equivalent of `isolcpus` in the Linux kernel would help?
For the curious, I've bundled up a few files that describe the machine and its setup:
[[dmesg](/assets/freebsd-vpp/france-dmesg.txt)]
[[pciconf](/assets/freebsd-vpp/france-pciconf.txt)]
[[loader.conf](/assets/freebsd-vpp/france-loader.conf)]
[[VPP startup.conf](/assets/freebsd-vpp/france-startup.conf)]

View File

@ -0,0 +1,644 @@
---
date: "2024-03-06T20:17:54Z"
title: VPP with Babel - Part 1
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. Thanks to the [[Linux ControlPlane]({% post_url 2021-08-12-vpp-1 %})]
plugin, higher level control plane software becomes available, that is to say: things like BGP,
OSPF, LDP, VRRP and so on become quite natural for VPP.
IPng Networks is a small service provider that has built a network based entirely on open source:
[[Debian]({% post_url 2023-12-17-defra0-debian %})] servers with widely available Intel and Mellanox
10G/25G/100G network cards, paired with [[VPP](https://fd.io/)] for the dataplane, and
[[Bird2](https://bird.nic.cz/)] for the controlplane.
As a small provider, I am well aware of the cost of IPv4 address space. Long gone are the times at
which an initial allocation was a /19, and subsequent allocations usually a /20 based on
justification. Then it watered down to a /22 for new _Local Internet Registries_, then that became a
/24 for new _LIRs_, and ultimately we ran out. What was once a plentiful resource, has now become a
very constrained resource.
In this first article, I want to show a rather clever way to conserve IPv4 addresses by exploring
one of the newer routing protocols: Babel.
## 🙁 A sad waste
I have to go back to something very fundamental about routing. When RouterA holds a routing table,
it will associate prefixes with next-hops and their associated interfaces. When RouterA gets a
packet, it'll look up the destination address, and then forward the packet on to RouterB which is
the next router in the path towards the destination:
1. RouterA does a route lookup in its routing table. For destination `192.0.2.1`, the covering
prefix is `192.0.2.0/24` and it might find that it can reach it via IPv4 next hop `100.64.0.1`.
1. RouterA then does another lookup in its routing table, to figure out how can it reach
`100.64.0.1`. It may find that this address is directly connected, say to interface `eth0`, on
which RouterA is `100.64.0.2/30`.
1. Assuming that `eth0` is an ethernet device, which the vast majority of interfaces are, then
RouterA can look up the link-layer address for that IPv4 address `100.64.0.1`, by using ARP.
1. The ARP request asks, quite literally `who-has 100.64.0.1?` using a broadcast message on
`eth0`, to which the other RouterB will answer `100.64.0.1 is-at 90:e2:ba:3f:ca:d5`.
1. Now that RouterA knows that, it can forward along the IP packet out on its `eth0` device and
towards `90:e2:ba:3f:ca:d5`. Huzzah.
## 🥰 A clever trick
I can't help but notice that the only purpose of having the `100.64.0.0/30` transit network between
these two routers is to:
1. provide the routers the ability to resolve IPv4 next hops towards link-layer MAC addresses,
using ARP resolution.
1. provide a means for the routers to send ICMP messages, for example in a traceroute, each hop
along the way will respond with an TTL exceeded message. And I do like traceroutes!
Let me discuss these two purposes in more detail:
### 1. IPv4 ARP, n&eacute;e IPv6 NDP
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
One really neat trick is simply replacing ARP resolution by something that can resolve the
link-layer MAC address in a different way. As it turns out, IPv6 has an equivalent that's
called _Neighbor Discovery Protocol_ in which a router can determine the link-layer address of a
neighbor, or to verify that a neighbor is still reachable via a cached link-layer address. This uses
ICMPv6 to send out a query with the _Neighbor Solicitation_, which is followed by a response in the
form of a _Neighbor Advertisement_.
Why am I talking about IPv6 neighbor discovery when I'm explaining IPv4 forwarding, you may be
wondering? Well, because of this neat trick that the IPv4 prefix brokers don't want you to know:
```
pim@vpp0-0:~$ sudo ip ro add 192.0.2.0/24 via inet6 fe80::5054:ff:fef0:1110 dev e1
pim@vpp0-0:~$ ip -br a show e1
e1 UP fe80::5054:ff:fef0:1101/64
pim@vpp0-0:~$ ip ro get 192.0.2.0
192.0.2.0 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0
cache
pim@vpp0-0:~$ ip neighbor | grep fe80::5054:ff:fef0:1110
fe80::5054:ff:fef0:1110 dev e1 lladdr 52:54:00:f0:11:10 REACHABLE
pim@vpp0-0:~$ sudo tcpdump -evni e1 host 192.0.2.0
tcpdump: listening on e1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:21:30.002878 52:54:00:f0:11:01 > 52:54:00:f0:11:10, ethertype IPv4 (0x0800), length 98:
(tos 0x0, ttl 64, id 21521, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.10.0 > 192.0.2.0: ICMP echo request, id 54710, seq 20, length 64
```
While it looks counter-intuitive at first, this is actually pretty straight forward. When the router
gets a packet destined for `192.0.2.0/24`, it will know that the next hop is some link-local IPv6
address, which it can resolve by _NDP_ on ethernet interface `e1`. It can then simply forward the
IPv4 datagram to the MAC address it found.
Who would've thunk that you do not need ARP or even IPv4 on the interface at all?
### 2. Originating ICMP messages
{{< image width="200px" float="right" src="/assets/vpp-babel/too-big.png" alt="Too Big" >}}
The Internet Control Message Protocol is described in
[[RFC792](https://datatracker.ietf.org/doc/html/rfc792)]. It's mostly used to carry diagnostic and
debugging information, either originated by end hosts, for example the "destination unreachable,
port unreachable" types of messages, but they may also be originated by intermediate routers, for
example with most other kinds of "destination unreachable" packets.
Path MTU Discovery, described in [[RFC1191](https://datatracker.ietf.org/doc/html/rfc1191)] allows a
host to discover the maximum packet size that a route is able to carry. There's a few different
types of _PMTUd_, but the most common one uses ICMPv4 packets coming from these intermediate
routers, informing them that packets which are marked as un-fragmentable, will not be able to be
transmitted due to them being too large.
Without the ability for a router to signal these ICMPv4 packets, end to end connectivity quality
might break undetected. So, every router that is able to forward IPv4 traffic SHOULD be able
originate ICMPv4 traffic.
If you're curious, you can read more in this [[IETF
Draft](https://www.ietf.org/archive/id/draft-chroboczek-int-v4-via-v6-01.html)] from Juliusz
Chroboczek et al. It's really insightful, yet elegant.
{{< image width="200px" float="right" src="/assets/vpp-babel/Babel_logo_black.svg" alt="Babel Logo" >}}
## Introducing Babel
I've learned so far that I (a) **MAY** use IPv6 link-local networks in order to _forward_ IPv4
packets, as I can use IPv6 _NDP_ to find the link-layer next hop; and (b) each router **SHOULD** be
able to _originate_ ICMPv4 packets, therefore it needs _at least one_ IPv4 address.
These two claims mean that I need _at most one_ IPv4 address on each router. Could it be?!
**Babel** is a loop-avoiding distance-vector routing protocol that is designed to be robust and
efficient both in networks using prefix-based routing and in networks using flat routing ("mesh
networks"), and both in relatively stable wired networks and in highly dynamic wireless networks.
The definitive [[RFC8966](https://datatracker.ietf.org/doc/html/rfc8966)] describes it in great
detail, and previous work are in [[RFC7557](https://datatracker.ietf.org/doc/html/rfc7557)] and
[[RFC6126](https://datatracker.ietf.org/doc/html/rfc6126)]. Lots of reading :) Babel is a _hybrid_
routing protocol, in the sense that it can carry routes for multiple network-layer protocols (IPv4
and IPv6), regardless of which protocol the Babel packets are themselves being carried over.
I quickly realise that Babel is hybrid in a different and very interesting way: it can set next-hops
across address families, which is described in [[RFC9229](https://datatracker.ietf.org/doc/html/rfc9229)]:
> When a packet is routed according to a given routing table entry, the forwarding plane typically
> uses a neighbour discovery protocol (the Neighbour Discovery (ND) protocol
> [[RFC4861](https://datatracker.ietf.org/doc/html/rfc4861)] in the case of IPv6 and the Address
> Resolution Protocol (ARP) [[RFC826](https://datatracker.ietf.org/doc/html/rfc826)] in the case of
> IPv4) to map the next-hop address to a link-layer address (a "Media Access Control (MAC)
> address"), which is then used to construct the link-layer frames that encapsulate forwarded
> packets.
>
> It is apparent from the description above that there is no fundamental reason why the destination
> prefix and the next-hop address should be in the same address family: there is nothing preventing
> an IPv6 packet from being routed through a next hop with an IPv4 address (in which case the next
> hop's MAC address will be obtained using ARP) or, conversely, an IPv4 packet from being routed
> through a next hop with an IPv6 address. (In fact, it is even possible to store link-layer
> addresses directly in the next-hop entry of the routing table, which is commonly done in networks
> using the OSI protocol suite).
### Babel and Bird2
There's an implementation of Babel in Bird2, the routing solution that I use at AS8298. What made me
extra enthusiastic, is that I found out the functionality described in RFC9229 was committed about a
year ago in Bird2
[[ref](https://gitlab.nic.cz/labs/bird/-/commit/eecc3f02e41bcb91d463c4c1189fd56bc44e6514)], with a
hat-tip to Toke Høiland-Jørgensen.
The Debian machines at IPng are current (Bookworm 12.5), but Debian still ships a version older than
this commit, so my first order of business is to get a Debian package:
```
pim@summer:~/src$ sudo apt install devscripts
pim@summer:~/src$ wget http://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14.orig.tar.gz
pim@summer:~/src$ tar xzf bird2_2.14.orig.tar.gz
pim@summer:~/src/bird-2.14$ wget http://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14-1.debian.tar.xz
pim@summer:~/src/bird-2.14$ tar xf bird2_2.14-1.debian.tar.xz
pim@summer:~/src/bird-2.14$ sudo mk-build-deps -i
pim@summer:~/src/bird-2.14$ sudo dpkg-buildpackage -b -uc -us
```
And that yields me a fresh Bird 2.14 package. I can't help but wonder though, why did the semantic
versioning [[ref](https://semver.org/)] of `2.0.X` change to `2.14`? I found an answer in the NEWS
file of the 2.13 release
[[link](https://gitlab.nic.cz/labs/bird/-/blob/7d2c7d59a363e690995eb958959f0bc12445355c/NEWS#L45-50)].
It's a little bit of a disappointment, but I quickly get over myself because I want to take this
Babel-Bird out for a test flight. Thank you for the Babel-Bird-Build, Summer!
### Babel and the LAB
I decide to take an IPng [[lab]({% post_url 2022-10-14-lab-1 %})] out for a spin. These labs come
with four VPP routers and two Debian machines connected like so:
{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
The configuration snippet for Bird2 is very simple, as most of the defaults are sensible:
```
pim@vpp0-0:~$ cat << EOF | sudo tee -a /etc/bird/bird.conf
protocol babel {
interface "e*" {
type wired;
extended next hop on;
};
ipv6 { import all; export all; };
ipv4 { import all; export all; };
}
EOF
pim@vpp0-0:~$ birdc show babel interfaces
BIRD 2.14 ready.
babel1:
Interface State Auth RX cost Nbrs Timer Next hop (v4) Next hop (v6)
e1 Up No 96 1 0.958 :: fe80::5054:ff:fef0:1101
pim@vpp0-0:~$ birdc show babel neigh
BIRD 2.14 ready.
babel1:
IP address Interface Metric Routes Hellos Expires Auth RTT (ms)
fe80::5054:ff:fef0:1110 e1 96 8 16 5.003 No 4.831
pim@vpp0-0:~$ birdc show babel entries
BIRD 2.14 ready.
babel1:
Prefix Router ID Metric Seqno Routes Sources
192.168.10.0/32 00:00:00:00:c0:a8:0a:00 0 1 0 0
192.168.10.0/24 00:00:00:00:c0:a8:0a:00 0 1 1 0
192.168.10.1/32 00:00:00:00:c0:a8:0a:01 96 7 1 0
2001:678:d78:200::/128 00:00:00:00:c0:a8:0a:00 0 1 0 0
2001:678:d78:200::/60 00:00:00:00:c0:a8:0a:00 0 1 1 0
2001:678:d78:200::1/128 00:00:00:00:c0:a8:0a:01 96 7 1 0
```
Based on this simple configuration, Bird2 will start the babel protocol on `e0` and `e1`, and it
quickly finds a neighbor with which it establishes an adjacency. Looking at the routing protocol
database (called _entries_), I can see my own IPv4 and IPv6 loopbacks (192.168.10.0 and
2001:678:d78:200::), the neighbor's IPv4 and IPv6 loopbacks (192.168.10.1 and 201:678:d78:200::1),
and finally the two supernets (192.168.10.0/24 and 2001:678:d78:200::/60).
The coolest part is the `extended next hop on` statement, which enables Babel to set the nexthop
to be an IPv6 address, which becomes clear very quickly when looking at the Linux routing table:
```
pim@vpp0-0:~$ ip ro
192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32
unreachable 192.168.10.0/24 proto bird metric 32
pim@vpp0-0:~$ ip -6 ro
2001:678:d78:200:: dev loop0 proto kernel metric 256 pref medium
2001:678:d78:200::1 via fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32 pref medium
unreachable 2001:678:d78:200::/60 dev lo proto bird metric 32 pref medium
fe80::/64 dev loop0 proto kernel metric 256 pref medium
fe80::/64 dev e1 proto kernel metric 256 pref medium
```
**✅ Setting IPv4 routes over IPv6 nexthops works!**
### Babel and VPP
For the [[VPP](https://fd.io)] configuration, I start off with a pretty much _empty_ configuration,
creating only a loopback interface called `loop0`, setting the interfaces up, and exposing them in
_LinuxCP_:
```
vpp0-0# create loopback interface instance 0
vpp0-0# set interface state loop0 up
vpp0-0# set interface ip address loop0 192.168.10.0/32
vpp0-0# set interface ip address loop0 2001:678:d78:200::/128
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0
vpp0-0# set interface state GigabitEthernet10/0/0 up
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1
vpp0-0# set interface state GigabitEthernet10/0/1 up
vpp0-0# lcp create loop0 host-if loop0
vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0
vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1
```
Between the four VPP routers, the only relevant difference is the IPv4 and IPv6 addresses of the
loopback device. For the rest, things are good. The routing tables quickly fill with all IPv4 and
IPv6 loopbacks across the network.
### Adding support to VPP
IPv6 pings and looks good. However, IPv4 endpoints do not ping yet. The first thing I look at, is
does VPP understand how to interpret an IPv4 route with an IPv6 nexthop? I think it does, because I
remember reviewing a change from Adrian during our MPLS [[project]({% post_url 2023-05-28-vpp-mpls-4
%})], which he submitted in this [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/38633)]. His change
allows VPP to use routes with `rtnl_route_nh_get_via()` to map them to a different address family,
exactly what I am looking for. The routes are correctly installed in the FIB:
```
pim@vpp0-0:~$ vppctl show ip fib 192.168.10.1
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[default-route:1, lcp-rt:1, ]
192.168.10.1/32 fib:0 index:31 locks:2
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
path-list:[51] locks:4 flags:shared, uPRF-list:42 len:1 itfs:[2, ]
path:[72] pl-index:51 ip6 weight=1 pref=32 attached-nexthop: oper-flags:resolved,
fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1
[@0]: ipv6 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f0110186dd
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:34 buckets:1 uRPF:42 to:[0:0]]
[0] [@5]: ipv4 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f011010800
```
Using the Open vSwitch tap I can see I can clearly see the packets go out from `vpp0-0.e1` and into
`vpp0-1.e0`, but there is no response, so they are getting lost in `vpp0-1` somewhere. I take a look
at a packet trace on `vpp0-1`, I'm expecting the ICMP packet there:
```
pim@vpp0-1:~$ vppctl show trace
07:42:53:178694: dpdk-input
GigabitEthernet10/0/0 rx queue 0
buffer 0x4c513d: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
ext-hdr-valid
PKT MBUF: port 0, nb_segs 1, pkt_len 98
buf_len 2176, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x29944fc0
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
IP4: 52:54:00:f0:11:01 -> 52:54:00:f0:11:10
ICMP: 192.168.10.0 -> 192.168.10.1
tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
fragment id 0xf52b, flags DONT_FRAGMENT
ICMP echo_request checksum 0x43b7 id 26166
07:42:53:178765: ethernet-input
frame: flags 0x1, hw-if-index 1, sw-if-index 1
IP4: 52:54:00:f0:11:01 -> 52:54:00:f0:11:10
07:42:53:178791: ip4-input
ICMP: 192.168.10.0 -> 192.168.10.1
tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
fragment id 0xf52b, flags DONT_FRAGMENT
ICMP echo_request checksum 0x43b7 id 26166
07:42:53:178810: ip4-not-enabled
ICMP: 192.168.10.0 -> 192.168.10.1
tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
fragment id 0xf52b, flags DONT_FRAGMENT
ICMP echo_request checksum 0x43b7 id 26166
07:42:53:178833: error-drop
rx:GigabitEthernet10/0/0
07:42:53:178835: drop
dpdk-input: no error
```
Okay, that checks out! Going over this packet trace, the `ip4-input` node indeed got handed a packet,
which it promptly rejected by forwarding it to `ip4-not-enabled` which drops it. It kind of makes
sense, the VPP dataplane doesn't think it's logical to handle IPv4 traffic on an interface which
does not have an IPv4 address. Except -- I'm bending the rules a little bit by doing exactly that.
#### Approach 1: force-enable ip4 in VPP
There's an internal function `ip4_sw_interface_enable_disable()` which is called to enable IPv4
processing on an interface once the first IPv4 address is added. So my first fix is to force this to
be enabled for any interface that is exposed via Linux Control Plane, notably in `lcp_itf_pair_create()`
[[here](https://github.com/pimvanpelt/lcpng/blob/main/lcpng_interface.c#L777)].
This approach is partially effective:
```
pim@vpp0-0:~$ ip ro get 192.168.10.1
192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0
cache
pim@vpp0-0:~$ ping -c5 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=3.92 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=3.81 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.75 ms
64 bytes from 192.168.10.1: icmp_seq=4 ttl=64 time=3.23 ms
64 bytes from 192.168.10.1: icmp_seq=5 ttl=64 time=2.67 ms
^C
--- 192.168.10.1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 2.673/3.477/3.921/0.467 ms
pim@vpp0-0:~$ traceroute 192.168.10.3
traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets
1 * * *
2 * * *
3 192.168.10.3 (192.168.10.3) 10.418 ms 10.343 ms 11.362 ms
```
I take a moment to think about why the traceroutes are not responding in the routers in the middle,
and it dawns on me that when the router needs to send an ICMPv4 TTL Exceeded message, it can't
select an IPv4 address to originate the message from, as the interface has none.
**🟠 Forwarding works, but ❌ PMTUd does not!**
#### Approach 2: Use unnumbered interfaces
Looking at my options, I see that VPP is capable of using so-called _unnumbered_ interfaces. These
can be left unconfigured, but borrow an address from another interface. It's a good idea to
borrow from `loop0`, which has a valid IPv4 and IPv6 address. It looks like this in VPP:
```
vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0
vpp0-0# show interface address
GigabitEthernet10/0/0 (dn):
unnumbered, use loop0
L3 192.168.10.0/32
L3 2001:678:d78:200::/128
GigabitEthernet10/0/1 (up):
unnumbered, use loop0
L3 192.168.10.0/32
L3 2001:678:d78:200::/128
loop0 (up):
L3 192.168.10.0/32
L3 2001:678:d78:200::/128
```
The Linux ControlPlane configuration will always synchronize interface information from VPP to
Linux, as I described back then when I [[worked on the plugin]({% post_url 2021-08-13-vpp-2 %})].
Babel starts and sets next hops for IPv4 that look like this:
```
pim@vpp0-2:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
loop0 UNKNOWN 192.168.10.2/32 2001:678:d78:200::2/128 fe80::dcad:ff:fe00:0/64
e0 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1120/64
e1 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1121/64
pim@vpp0-2:~$ ip ro
192.168.10.0 via 192.168.10.1 dev e0 proto bird metric 32 onlink
unreachable 192.168.10.0/24 proto bird metric 32
192.168.10.1 via 192.168.10.1 dev e0 proto bird metric 32 onlink
192.168.10.3 via 192.168.10.3 dev e1 proto bird metric 32 onlink
```
While on the surface this looks good, for VPP it clearly poses a problem, as my IPv4 neighbors
(192.168.10.1 and 192.168.10.3) are not reachable:
```
pim@vpp0-2:~# ping -c3 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
From 192.168.10.2 icmp_seq=1 Destination Host Unreachable
From 192.168.10.2 icmp_seq=2 Destination Host Unreachable
From 192.168.10.2 icmp_seq=3 Destination Host Unreachable
--- 192.168.10.1 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2034ms
```
I take a look at why that might be, and I notice this on the neighbor `vpp0-1` when I try to ping it
from `vpp0-2`:
```
vpp0-1# show err
Count Node Reason Severity
5 arp-reply IP4 source address not local to sub error
1 arp-reply IP4 source address matches local in error
```
Oh, snap! I traced this down to `src/vnet/arp/arp.c` around line 522 where I can see that VPP, when
it receives an ARP request, wants that to be coming from a peer that is in its own subnet. But with a
point to point link like this one, there is nobody else in the `192.168.10.1/32` subnet! I think
this error should not be returned if the interface is `arp_unnumbered()`, defined further up in the
same source file. I write a small patch in Gerrit [[40482](https://gerrit.fd.io/r/c/vpp/+/40482)]
which removes this requirement and the test that asserts the previous behavior, allowing the ARP
request to succeed, and things shoot to life:
```
pim@vpp0-2:~$ ping -c3 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=11.5 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=1.69 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.03 ms
--- 192.168.10.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms
rtt min/avg/max/mdev = 1.689/5.394/11.468/4.329 ms
```
I make a mental note to discuss this ARP relaxation Gerrit with [[vpp-dev](https://lists.fd.io/g/vpp-dev/message/24155)],
and I'll see where that takes me.
**✅ Forwarding IPv4 routes over IPv4 point-to-point nexthops works!**
#### Approach 3: VPP Unnumbered Hack
At this point, I think I'm good, but one of the cool features of Babel is that it can use IPv6 next
hops for IPv4 destinations. Setting `GigabitEthernet10/0/X` to unnumbered will make
`192.168.10.X/32` reappear on the `e0` an `e1` interfaces, which will make Babel prefer the more
classic IPv4 next-hops. So can I trick it somehow to use IPv6 anyway ?
One option is to ask Babel to use `extended next hop` even when IPv4 is available, which would be a
change to Bird (and possibly a violation of the Babel specification, I should read up on that).
But I think there's another way, so I take a look at the VPP code which prints out the **unnumbered,
use loop0** message, and I find a way to know if an interface is borrowing addresses in this way. I
decide to change the LCP plugin to _inhibit_ sync'ing the addresses if they belong to an interface
which is unnumbered. Because I don't know for sure if everybody would find this behavior desirable,
I make sure to guard the behavior behind a backwards compatible configuration option.
If you're curious, please take a look at the change in my [[GitHub
repo](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
which I:
1. add a new configuration option, `lcp-sync-unnumbered`, which defaults to `on`. That would be
what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux.
1. add a CLI call to change the value, `lcp lcp-sync-unnumbered [on|enable|off|disable]`
1. extend the CLI call to show the LCP plugin state, as an additional output of `lcp show`
And with that, the VPP configuration becomes:
```
vpp0-0# lcp lcp-sync on
vpp0-0# lcp lcp-sync-unnumbered off
vpp0-0# create loopback interface instance 0
vpp0-0# set interface state loop0 up
vpp0-0# set interface ip address loop0 192.168.10.0/32
vpp0-0# set interface ip address loop0 2001:678:d78:200::/128
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0
vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-0# set interface state GigabitEthernet10/0/0 up
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1
vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0
vpp0-0# set interface state GigabitEthernet10/0/1 up
vpp0-0# lcp create loop0 host-if loop0
vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0
vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1
```
## Results
I can claim plausible success on this effort, which makes me wiggle in my seat a little bit, I have
to admit:
```
pim@vpp0-0:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
loop0 UNKNOWN 192.168.10.0/32 2001:678:d78:200::/128 fe80::dcad:ff:fe00:0/64
e0 UP fe80::5054:ff:fef0:1100/64
e1 UP fe80::5054:ff:fef0:1101/64
e2 DOWN
e3 DOWN
pim@vpp0-0:~$ traceroute -n 192.168.10.3
traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets
1 192.168.10.1 1.882 ms 2.231 ms 1.472 ms
2 192.168.10.2 4.243 ms 3.492 ms 2.797 ms
3 192.168.10.3 6.689 ms 5.925 ms 5.157 ms
pim@vpp0-0:~$ traceroute -n 2001:678:d78:200::3
traceroute to 2001:678:d78:200::3 (2001:678:d78:200::3), 30 hops max, 80 byte packets
1 2001:678:d78:200::1 2.543 ms 1.762 ms 2.154 ms
2 2001:678:d78:200::2 4.943 ms 3.063 ms 3.562 ms
3 2001:678:d78:200::3 6.273 ms 6.694 ms 7.086 ms
```
**✅ Forwarding IPv4 routes over IPv6 nexthops works, ICMPv4 works, PMTUd works!**
I recorded a little [[screencast](/assets/vpp-babel/screencast.cast)] that shows my work, so far:
{{< image src="/assets/vpp-babel/screencast.gif" alt="Babel IPv4-less VPP" >}}
## Additional thoughts
### Comparing OSPFv2 and Babel
Ondrej from the Bird team pointed out (thank you!) that OSPFv2 can also be made to avoid use of IPv4
transit networks, by making use of this `peer` pattern, which is similar but not quite the same as
what I discussed in _Approach 2_ above:
```
$ ip addr add 192.168.10.2 peer 192.168.10.1 dev e0
$ ip addr add 192.168.10.2 peer 192.168.10.3 dev e1
```
The Linux ControlPlane plugin is not currently capable of accepting the `peer` netlink message, and
I can see a problem: VPP does not allow for two interfaces to have the same IP address, _unless_ one
is borrowing from another using _unnumbered_. I wonder why that is ...
I could certainly give implementing that `peer` pattern in Netlink a go, but I'm not enthusiastic.
To consume the netlink message correctly, the plugin would need to assert that left hand (source) IPv4
address strictly corresponds to a loopback, and then internally rewrite the address addition into
a _unnumbered_ use, and also somehow reject (delete?) the netlink configuration otherwise. Ick!
I think there's a more idiomatic way of doing this in VPP. OSPFv2 doesn't really _need_ to use the
`peer` pattern, as long as the point to point peer is reachable. Babel is emitting a static route
over the interface after using IPv6 to learn its peer's IPv4 address, which is really neat! I
suppose for OSPFv2 setting a manual static route for the peer into the device would do the trick as
well.
The VPP idiom for the `peer` pattern above, which Babel does naturally, and OSPFv2 could be manually
configured to do, would look like this:
```
vpp0-2# set interface ip address loop0 192.168.10.2/32
vpp0-2# set interface state loop0 up
vpp0-2# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-2# set interface state GigabitEthernet10/0/0 up
vpp0-2# ip route add 192.168.10.1/32 via 192.168.10.1 GigabitEthernet10/0/0
vpp0-2# set interface unnumbered GigabitEthernet10/0/1 use loop0
vpp0-2# set interface state GigabitEthernet10/0/1 up
vpp0-2# ip route add 192.168.10.3/32 via 192.168.10.3 GigabitEthernet10/0/1
```
Either way, when using point to point connections (like these explicit static routes, or the implied
static routes that the `peer` pattern will yield) over an ethernet broadcast medium, will require to
get the ARP [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)] merged. This one seems reasonably
straight forward because allowing point to point to work over an ethernet broadcast medium is
successfully done in many popular vendors, and I can't find any RFC that forbids it. Perhaps VPP is
being a bit too strict.
### To Unnumbered or Not To Unnumbered
I'm torn between _Approach 2_ and _Approach 3_. While on the one hand, setting the _unnumbered_
interface would be best reflected in Linux, it is not without problems. If the operator subsequently
tries to remove one of the addresses on `e0` or `e1`, that will yield a desync between Linux and
VPP (Linux will have removed the address, but VPP will still be _unnumbered_). On the other hand,
tricking Linux (and the operator) to believe there isn't an IPv4 (and IPv6) address configured on
the interface, is also not great.
Of the two approaches, I think I prefer _Approach 3_ (changing the Linux CP plugin to not sync
unnumbered addresses), because it minimizes the chance of operator error. If you're reading this and
have an Opinion&trade;, would you please let me know?
## What's Next
I think that over time, IPng Networks might replace OSPF and OSPFv3 with Babel, as it will allow me
to retire the many /31 IPv4 and /112 IPv6 transit networks (which consume about half of my routable
IPv4 addresses!). I will discuss my change with the VPP and Babel/Bird Developer communities and see
if it makes sense to upstream my changes. Personally, I think it's a reasonable direction, because
(a) both changes are backwards compatible and (b) its semantics are pretty straight forward. I'll
also add some configuration knobs to [[vppcfg]({% post_url 2022-04-02-vppcfg-2 %})] to make it
easier to configure VPP in this way.
Of course, migrating AS8298 won't be overnight, I need to gain a bit more confidence, and obviously
upgrade both Bird2 and VPP using my changes, which I think might benefit from a bit of peer review.
And finally I need to roll this new IPv4-less IGP out very carefully and without interruptions,
which considering the IGP is the most fundamental building block of the network, may be tricky.
But, I am uncomfortably excited by the prospect of having my network go entirely without backbone
transit networks. By the way: Babel is amazing!

View File

@ -0,0 +1,438 @@
---
date: "2024-04-06T10:17:54Z"
title: VPP with loopback-only OSPFv3 - Part 1
---
{{< image width="200px" float="right" src="/assets/vpp-ospf/bird-logo.svg" alt="Bird" >}}
# Introduction
A few weeks ago I took a good look at the [[Babel]({% post_url 2024-03-06-vpp-babel-1 %})] protocol.
I found a set of features there that I really appreciated. The first was a latency aware routing
protocol - this is useful for mesh (wireless) networks but it is also a good fit for IPng's usecase,
notably because it makes use of carrier ethernet which, if any link in the underlying MPLS network
fails, will automatically re-route but sometimes with much higher latency. In these cases, Babel can
reconverge on its own to a topology that has the lowest end to end latency.
But a second really cool find, is that Babel can use IPv6 nexthops for IPv4 destinations - which is
_super_ useful because it will allow me to retire all of the IPv4 /31 point to point networks
between my routers. AS8298 has about half of a /24 tied up in these otherwise pointless (pun
intended) transit networks.
In the same week, my buddy Benoit asked a question about OSPFv3 on the Bird users mailinglist
[[ref](https://www.mail-archive.com/bird-users@network.cz/msg07961.html)] which may or may not have
been because I had been messing around with Babel using only IPv4 loopback interfaces. And just a
few weeks before that, the incomparable Nico from [[Ungleich](https://ungleich.ch/)] had a very
similar question [[ref](https://bird.network.cz/pipermail/bird-users/2024-January/017370.html)].
These three folks have something in common - we're all trying to conserve IPv4 addresses!
# OSPFv3 with IPv4 🙁
Nico's thread referenced [[RFC 5838](https://datatracker.ietf.org/doc/html/rfc5838)] which defines
support for multiple address families in OSPFv3. It does this by mapping a given address family to
a specific instance of OSPFv3 using the _instance id_ and adding a new option to the _options field_
that tells neighbors that multiple address families are supported in this instance (and thus, that
the neighbor should not assume all link state advertisements are IPv6-only).
This way, multiple instances can run on the same router, and they will only form adjacencies with
neighbors that are operating in the same address family. This in itself doesn't change much: rather
than using IPv4 multicast in the hello's while forming adjacencies, OSPFv3 will use IPv6 link local
addresses for them.
RFC 5838, Section 2.5 says:
> Although IPv6 link local addresses could be used as next hops for IPv4 address families, it is
> desirable to have IPv4 next-hop addresses. [ ... ] In order to achieve this, the link's IPv4
> address will be advertised in the "link local address" field of the IPv4 instance's Link-LSA.
> This address is placed in the first 32 bits of the "link local address" field and is used for IPv4
> next-hop calculations. The remaining bits MUST be set to zero.
First my hopes are raised by saying IPv6 link local addresses _could_ be used as
next hops (just like Babel, yaay!), but then it goes on to say the link local address field will be
overridden with an IPv4 address in the top 32 bits. That's ... gross. I understand why this was
done, it allows for a minimal deviation of the OSPFv3 protocol, but this unfortunate choice
precludes the ability for IPv6 nexthops to be used. Crap on a cracker!
# OSPFv3 with IPv4 🥰
But wait, not all is lost! Remember in my [[VPP Babel]({% post_url 2024-03-06-vpp-babel-1 %})]
article I mentioned that VPP has this ability to run _unnumbered_ interfaces? To recap, this is a
configuration where a primary interface, typically a loopback, will have an IPv4 and IPv6 address,
say **192.168.10.2/32** and **2001:678:d78:200::2/128** and other interfaces will borrow from that.
That will allow for the IPv4 address to be present on multiple interfaces, like so:
```
pim@vpp0-2:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
loop0 UNKNOWN 192.168.10.2/32 2001:678:d78:200::2/128 fe80::dcad:ff:fe00:0/64
e0 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1120/64
e1 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1121/64
```
### VPP Changes
Historically in VPP, broadcast medium like ethernet will respond to ARP requests only if the
requestor is in the same subnet. With these point to point interfaces, the remote will _never_ be in
the same subnet, because we're using /32 addresses here! VPP logs these as invalid ARP requests.
With a small change though, I can make VPP become tolerant of this scenario, and the consensus
in the VPP community is that this is OK.
{{< image src="/assets/vpp-ospf/vpp-diff1.png" alt="VPP Diff #1" >}}
Check out [[40482](https://gerrit.fd.io/r/c/vpp/+/40482)] for the full change, but in a nutshell,
just before deciding to return an error because the requesting source address is not directly
connected (called an `attached` route in VPP), I'll change the condition to allow for it, if and
only if the ARP request comes from an _unnumbered_ interface.
I think this is a good direction, if only because most other popular implementations (including
Linux, FreeBSD, Cisco IOS/XR and Juniper) will answer ARP requests that are onlink but not directly
connected, in the same way.
### Bird2 Changes
Meanwhile, in the Bird community, we were thinking about solving this problem in a different way.
Babel allows a feature to use IPv6 transit networks with IPv4 destinations, by specifying an option
called `extended next hop`. With this option, Babel will set a nexthop across address families. It
may sound freaky at first, but it's not too strange when you think about it. Take a look at my
explanation in the [[Babel]({% post_url 2024-03-06-vpp-babel-1 %})] article on how IPv6 neighbor
discovery can take the place of IPv4 ARP resolution to figure out the ethernet next hop.
So our initial take was: why don't we do that with OSPFv3 as well? We thought of a trick to
get that Link LSA hack from RFC5838 removed: what if Bird, upon setting the `extended next hop`
feature on an interface, would simply put the IPv6 address back like it was, rather than overwriting
it with the IPv4 address? That way, we'd just learn routes to IPv4 destinations with nexthops on
IPv6 linklocal addresses. It would break compatibility with other vendors, but seeing as it is an
optional feature which defaults to off, perhaps it is a reasonable compromise...
Ondrej started to work on it, but came back a few days later with a different solution, which is
quite clever. Any IPv4 router needs at least one IPv4 address anyways, to be able to send ICMP
messages, so there is no need to put IPv4 addresses on links. Ondrej's theory corroborates my
previous comments for Babel's IPv4-less routing:
> Ive learned so far that I (a) MAY use IPv6 link-local networks in order to forward IPv4 packets,
> as I can use IPv6 NDP to find the link-layer next hop; and (b) each router SHOULD be able to
> originate ICMPv4 packets, therefore it needs at least one IPv4 address.
>
> These two claims mean that I need at most one IPv4 address on each router.
Ondrej's proposal for Bird2 will, when OSPFv3 is used with IPv4 destinations, keep the RFC5838
behavior and try to _find_ a working IPv4 address to put in the Link LSA:
{{< image src="/assets/vpp-ospf/bird-diff1.png" alt="Bird Diff #1" >}}
He adds a function `update_loopback_addr()`, which scans all interfaces for an IPv4 address, and if
there are multiple, prefer host addresses, then addresses from OSPF stub interfaces, and finally
just any old IPv4 address. Now that IPv4 address can be simply used to put in the Link LSA. Slick!
His change also removes next-hop-in-address-range check for OSPFv3 when using IPv4, and
automatically adds onlink flag to such routes, which newly accepts next hops that are not directly
connected:
{{< image src="/assets/vpp-ospf/bird-diff2.png" alt="Bird Diff #2" >}}
I realize when reading the code that this change paired with the [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]
are perfect partners:
1. Ondrej's change will make the Link LSA be _onlink_, which is a way to describe that the next hop
is not directly connected, in other words nexthop `192.168.10.3/32`, while the router itself is
`192.168.10.2/32`.
1. My change will make VPP answer for ARP requests in such a scenario where the router with an
_unnumbered_ interface with `192.168.10.3/32` will respond to a request from the not directly
connected _onlink_ peer at `192.168.10.2`.
## Tying it together
With all of that, I am ready to demonstrate two working solutions now. I first compile Bird2 with
Ondrej's [[commit](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)].
Then, I compile VPP with my pending [[gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]. Finally,
to demonstrate how `update_loopback_addr()` might work, I compile `lcpng` with my previous
[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
which allows me to inhibit copying forward addresses from VPP to Linux, when using _unnumbered_
interfaces.
I take an IPng lab instance out for a spin with this updated Bird2 and VPP+lcpng environment:
{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
### Solution 1: Somewhat unnumbered
I configure an otherwise empty VPP dataplane as follows:
```
vpp0-3# lcp lcp-sync on
vpp0-3# lcp lcp-sync-unnumbered on
vpp0-3# create loopback interface instance 0
vpp0-3# set interface state loop0 up
vpp0-3# set interface ip address loop0 192.168.10.3/32
vpp0-3# set interface ip address loop0 2001:678:d78:200::3/128
vpp0-3# set interface mtu 9000 GigabitEthernet10/0/0
vpp0-3# set interface mtu packet 9000 GigabitEthernet10/0/0
vpp0-3# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-3# set interface state GigabitEthernet10/0/0 up
vpp0-3# lcp create loop0 host-if loop0
vpp0-3# lcp create GigabitEthernet10/0/0 host-if e0
```
Which yields the following configuration:
```
pim@vpp0-3:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
loop0 UNKNOWN 192.168.10.3/32 2001:678:d78:200::3/128 fe80::dcad:ff:fe00:0/64
e0 UP 192.168.10.3/32 2001:678:d78:200::3/128 fe80::5054:ff:fef0:1130/64
pim@vpp0-3:~$ ip route get 182.168.10.2
RTNETLINK answers: Network is unreachable
```
I can see that VPP copied forward the IPv4/IPv6 addresses to interface `e0`, and because there's no
routing protocol running yet, the neighbor router `vpp0-2` is unreachable. Let me fix that, next. I
start bird in the VPP dataplane network namespace, and configure it as follows:
```
router id 192.168.10.3;
protocol device { scan time 30; }
protocol direct { ipv4; ipv6; check link yes; }
protocol kernel kernel4 {
ipv4 { import none; export where source != RTS_DEVICE; };
learn off; scan time 300;
}
protocol kernel kernel6 {
ipv6 { import none; export where source != RTS_DEVICE; };
learn off; scan time 300;
}
protocol bfd bfd1 {
interface "e*" {
interval 100 ms;
multiplier 20;
};
}
protocol ospf v3 ospf4 {
ipv4 { export all; import where (net ~ [ 192.168.10.0/24+, 0.0.0.0/0 ]); };
area 0 {
interface "loop0" { stub yes; };
interface "e0" { type pointopoint; cost 5; bfd on; };
};
}
protocol ospf v3 ospf6 {
ipv6 { export all; import where (net ~ [ 2001:678:d78:200::/56, ::/0 ]); };
area 0 {
interface "loop0" { stub yes; };
interface "e0" { type pointopoint; cost 5; bfd on; };
};
}
```
This minimal Bird2 configuration will configure the main protocols `device`, `direct`,
and two kernel protocols `kernel4` and `kernel6`, which are instructed to export learned routes from
the kernel for all but directly connected routes (because the Linux kernel and VPP already have
these when an interface is brought up, this avoids duplicate connected route entries).
If you haven't come across it yet, _Bidirectional Forwarding Detection_ or _BFD_ is a protocol that
repeatedly sends UDP packets between routers, to be able to detect if the forwarding is interrupted
even if the interface link stays up. It's described in detail in
[[RFC5880](https://www.rfc-editor.org/rfc/rfc5880.txt)], and I use it at IPng Networks all over the
place.
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
Then I'll configure two OSPF protocols, one for IPv4 called `ospf4` and another for IPv6 called
`ospf6`. It's easy to overlook, but while usually the IPv4 protocol is OSPFv2 and the IPv6 protocol
is OSPFv3, here _both_ are using OSPFv3! I'll instruct Bird to erect a _BFD_ session for any
neighbor it establishes an adjacency with. If at any point the BFD session times out (currently at
20x100ms or 2.0s), OSPF will tear down the adjacency.
The OSPFv3 protocols each define one channel, in which I allow Bird to export anything, but import
only those routes that are in the LAB IPv4 (192.168.10.0/24) and IPv6 (2001:687:d78:200::/56), and
I'll also allow a default to be learned over OSPF for both address families. That'll come in handy
later.
I start up Bird on the rightmost two routers in the lab (`vpp0-3` and `vpp0-2`). Looking at
`vpp0-3`, Bird starts sending IPv6 hello packets on interface `e0`, and pretty quickly finds not one
but two neighbors:
```
pim@vpp0-3:~$ birdc show ospf neighbors
BIRD v2.15.1-4-g280daed5-x ready.
ospf4:
Router ID Pri State DTime Interface Router IP
192.168.10.2 1 Full/PtP 30.870 e0 fe80::5054:ff:fef0:1121
ospf6:
Router ID Pri State DTime Interface Router IP
192.168.10.2 1 Full/PtP 30.870 e0 fe80::5054:ff:fef0:1121
```
Bird is able to sort out which is which on account of the 'normal' IPv6 OSPFv3 having an _instance
id_ value of 0 (IPv6 Unicast), and the IPv4 OSPFv3 having an _instance id_ of 64 (IPv4 Unicast).
Further, the IPv4 variant will set the AF-bit in the OSPFv3 options, so the peer will know it
supports using the Link LSA to model IPv4 nexthops rather than IPv6 nexthops.
Indeed, routes are quickly learned:
```
pim@vpp0-3:~$ birdc show route table master4
BIRD v2.15.1-4-g280daed5-x ready.
Table master4:
192.168.10.3/32 unicast [direct1 13:02:56.883] * (240)
dev loop0
unicast [direct1 13:02:56.883] (240)
dev e0
unicast [ospf4 13:02:56.980] I (150/0) [192.168.10.3]
dev loop0
dev e0
192.168.10.2/32 unicast [ospf4 13:03:04.980] * I (150/5) [192.168.10.2]
via 192.168.10.2 on e0 onlink
```
They are quickly propagated both to the Linux kernel, and by means of Netlink into the Linux Control
Plane plugin in VPP, which programs it into VPP's _FIB_:
```
pim@vpp0-3:~$ ip ro
192.168.10.2 via 192.168.10.2 dev e0 proto bird metric 32 onlink
pim@vpp0-3:~$ vppctl show ip fib 192.168.10.2
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ]
192.168.10.2/32 fib:0 index:23 locks:3
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
path-list:[40] locks:2 flags:shared, uPRF-list:22 len:1 itfs:[1, ]
path:[53] pl-index:40 ip4 weight=1 pref=32 attached-nexthop: oper-flags:resolved,
192.168.10.2 GigabitEthernet10/0/0
[@0]: ipv4 via 192.168.10.2 GigabitEthernet10/0/0: mtu:9000 next:6 flags:[] 525400f01121525400f011300800
adjacency refs:1 entry-flags:attached, src-flags:added, cover:-1
path-list:[43] locks:1 uPRF-list:24 len:1 itfs:[1, ]
path:[56] pl-index:43 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
192.168.10.2 GigabitEthernet10/0/0
[@0]: ipv4 via 192.168.10.2 GigabitEthernet10/0/0: mtu:9000 next:6 flags:[] 525400f01121525400f011300800
Extensions:
path:56
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:28 buckets:1 uRPF:22 to:[0:0]]
[0] [@5]: ipv4 via 192.168.10.2 GigabitEthernet10/0/0: mtu:9000 next:6 flags:[] 525400f01121525400f011300800
```
The neighbor is reachable, over IPv6 (which is nothing special), but also over IPv4:
```
pim@vpp0-3:~$ ping -c5 2001:678:d78:200::2
PING 2001:678:d78:200::2(2001:678:d78:200::2) 56 data bytes
64 bytes from 2001:678:d78:200::2: icmp_seq=1 ttl=64 time=2.16 ms
64 bytes from 2001:678:d78:200::2: icmp_seq=2 ttl=64 time=3.69 ms
64 bytes from 2001:678:d78:200::2: icmp_seq=3 ttl=64 time=2.66 ms
64 bytes from 2001:678:d78:200::2: icmp_seq=4 ttl=64 time=2.30 ms
64 bytes from 2001:678:d78:200::2: icmp_seq=5 ttl=64 time=2.92 ms
--- 2001:678:d78:200::2 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 2.164/2.747/3.687/0.540 ms
pim@vpp0-3:~$ ping -c5 192.168.10.2
PING 192.168.10.2 (192.168.10.2) 56(84) bytes of data.
64 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=3.58 ms
64 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=3.40 ms
64 bytes from 192.168.10.2: icmp_seq=3 ttl=64 time=3.28 ms
64 bytes from 192.168.10.2: icmp_seq=4 ttl=64 time=3.32 ms
64 bytes from 192.168.10.2: icmp_seq=5 ttl=64 time=3.29 ms
--- 192.168.10.2 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4007ms
rtt min/avg/max/mdev = 3.283/3.374/3.577/0.109 ms
```
**✅ OSPFv3 with IPv4/IPv6 on-link nexthops works!**
### Solution 2: Truly unnumbered
However, Ondrej's patch does something in addition to this. I repeat the same setup, except now I
set one additional feature when starting up VPP: `lcp lcp-sync-unnumbered off`
What happens next is that VPP's dataplane looks subtly different. It has created an _unnumbered_
interface keyed off of `loop0`, but it doesn't propagate the addresses to Linux.
```
pim@vpp0-3:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
loop0 UNKNOWN 192.168.10.3/32 2001:678:d78:200::3/128 fe80::dcad:ff:fe00:0/64
e0 UP fe80::5054:ff:fef0:1130/64
```
With `e0` only having a linklocal address, Bird can still form an adjacency with its neighbor
`vpp0-2`, because adjacencies in OSPFv3 are formed using IPv6 only. However, the clever trick to
walk the list of interfaces in `update_loopback_addr()` will be able to find a usable IPv4 address,
and use that to put in the Link LSA using RFC5838. In this case, it finds `192.168.10.3` from
interface `loop0` so it'll use that to signal the next hop for LSAs that it sends.
Now I start the same VPP and Bird configuration on all four VPP routers, but on `vpp0-0` I'll add a
static route out of the LAB to the internet:
```
protocol static static4 {
ipv4 { export all; };
route 0.0.0.0/0 via 192.168.10.4;
}
protocol static static6 {
ipv6 { export all; };
route ::/0 via 2001:678:d78:201::ffff;
}
```
These two default routes from `vpp0-0` quickly propagate through the network, where `vpp0-3`
ultimately sees this:
```
pim@vpp0-3:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
loop0 UNKNOWN 192.168.10.3/32 2001:678:d78:200::3/128 fe80::dcad:ff:fe00:0/64
e0 UP fe80::5054:ff:fef0:1130/64
pim@vpp0-3:~$ ip ro
default via 192.168.10.2 dev e0 proto bird metric 32 onlink
192.168.10.0 via 192.168.10.2 dev e0 proto bird metric 32 onlink
192.168.10.1 via 192.168.10.2 dev e0 proto bird metric 32 onlink
192.168.10.2 via 192.168.10.2 dev e0 proto bird metric 32 onlink
192.168.10.4/31 via 192.168.10.2 dev e0 proto bird metric 32 onlink
pim@vpp0-3:~$ ip -6 ro
2001:678:d78:200:: via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium
2001:678:d78:200::1 via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium
2001:678:d78:200::2 via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium
2001:678:d78:200::3 dev loop0 proto kernel metric 256 pref medium
2001:678:d78:201::/112 via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium
fe80::/64 dev loop0 proto kernel metric 256 pref medium
fe80::/64 dev e0 proto kernel metric 256 pref medium
default via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium
```
**✅ OSPFv3 with loopback-only, _unnumbered_ IPv4/IPv6 interfaces works!**
## Results
I thought I'd record a little [[asciinema](/assets/vpp-ospf/clean.cast)] that shows the end to end
configuration, starting from an empty dataplane and bird configuration. I'll show _Solution 2_, that
is, the solution that doesn't copy the _unnumbered_ interfaces in VPP to Linux.
Ready? Here I go!
{{< image src="/assets/vpp-ospf/clean.gif" alt="OSPF IPv4-less VPP" >}}
### To unnumbered or Not To unnumbered
I'm torn between _Solution 1_ and _Solution 2_. While on the one hand, setting the _unnumbered_
interface would be best reflected in Linux, it is not without problems. If the operator subsequently
tries to remove one of the addresses on `e0` or `e1`, that will yield a desync between Linux and
VPP (Linux will have removed the address, but VPP will still be _unnumbered_). On the other hand,
tricking Linux (and the operator) to believe there isn't an IPv4 (and IPv6) address configured on
the interface, is also not great.
Of the two approaches, I think I prefer _Solution 2_ (configuring the Linux CP plugin to not sync
_unnumbered_ addresses), because it minimizes the chance of operator error. If you're reading this and
have an Opinion&trade;, would you please let me know?

View File

@ -0,0 +1,230 @@
---
date: "2024-04-27T10:52:11Z"
title: FreeIX - Remote
---
# Introduction
{{< image width="300px" float="right" src="/assets/freeix/openart-image_REzWzO43_1714219288118_raw.jpg" alt="OpenART" >}}
Tier1 and aspiring Tier2 providers interconnect only in large metropolitan areas, due to commercial incentives and
politics. They won't often peer with smaller providers, because why peer with a potential customer? Due to this,
its entirely likely that traffic between two parties in Thessaloniki is sent to Frankfurt or Milan and back.
One possible antidote to this is to connect to a local Internet Exchange point. Not all ISPs have access to large
metropolitan datacenters where larger internet exchanges have a point of presence, and it doesn't help that the
datacenter operator is happy to charge a substantial amount of money each month, just for the privilege of having
a passive fiber cross connect to the exchange. Many Internet Exchanges these days ask for per-month port costs *and*
meter the traffic with policers and rate limiters, such that the total cost of peering starts to exceed what one
might pay for transit, especially at low volumes, which further exacerbates the problem. Bah.
This is an unfortunate market effect (the race to the bottom), where transit providers are continuously lowering their
prices to compete. And while transit providers can make up to some extent due to economies of scale, at some point they
are mostly all of equal size, and thus the only thing that can flex is quality of service.
The benefit of using an Internet Exchange is to reduce the portion of an ISPs (and CDNs) traffic that must be
delivered via their upstream transit providers, thereby reducing the average per-bit delivery cost and as well reducing
the end to end latency as seen by their users or customers. Furthermore, the increased number of paths available through
the IXP improves routing efficiency and fault-tolerance, and it avoids traffic going the scenic route to a large hub
like Frankfurt, London, Amsterdam, Paris or Rome, if it could very well remain local.
IPng Networks really believes in an open and affordable Internet, and I would like to do my part in ensuring the
internet stays accessible for smaller parties.
## Smöl IXPs
One notable problem with small exchanges, like for example [[FNC-IX](https://www.fnc-ix.net/)] in the Paris metro, or
[[CHIX-CH](https://ch-ix.ch/)], [[Community IX](https://www.community-ix.ch/)] and [[Free-IX](https://free-ix.ch/)] in
the Zurich metropolitan area, is that they are, well, small. They may be cheaper to connect to, in some cases even free,
but they don't have a sizable membership which means that there is inherently less traffic flowing, which in turn makes
it less appealing for prospect members to connect to.
At IPng, I have partnered with a few super cool ISPs and carriers to offer a Free Internet Exchange platform. Just to
head the main question off at the pass: _Free_ here actually does mean "Free as in beer" or
[[Gratis](https://en.wikipedia.org/wiki/Gratis)], a gift to the community that does not cost money. It also more
philosophically wants to be "Free as in open, and transparent" or
[[Libre](https://en.wikipedia.org/wiki/Free_software)].
Two examples are:
* [[Free IX: Switzerland](https://free-ix.ch/)] with POPs at STACK GEN01 Geneva, NTT Zurich and Bancadati Lugano.
* [[Free IX: Greece](https://free-ix.gr/)] with POPs at TISparkle in Athens and Balkan Gate in Thessaloniki.
.. but there are actually quite a few out there once you start looking :)
## Growing Smöl IXPs
Some internet exchanges break through the magical 1Tbps barrier (and get a courtesy callout on Twitter from Dr. King),
but many remain smöl. Perhaps it's time to break the _chicken-and-egg_ problem. What if there was a way to interconnect
these exchanges?
Let's take for example the Free IX in Greece that was announced at GRNOG16 in Athens on April 19th. This exchange
initially targets Athens and Thessaloniki, with 2x100G between the two cities. Members can connect to either site for
the cost of only a cross connect. The 1G/10G/25G ports will be _Gratis_. But I will be connecting one very special
member to Free IX Greece, AS50869:
{{< image src="/assets/freeix/Free IX Remote.svg" alt="FreeIX Remote" >}}
## Free IX: Remote
Here's what I am going to build. The _Free IX Remote_ project offers an outreach infrastructure which connects to
internet exchange points, and allows members to benefit from that in the following way:
1. FreeIX uses AS50869 to peer with any network operator who is available at public internet exchanges or using
private interconnects. It looks like a normal service provider in this regard. It will connect to internet
exchanges, and learn a bunch of routes.
1. FreeIX _members_ can join the program, after which they are granted certain propagation permissions by FreeIX
at the point where they have a BGP session with AS50869. The prefixes learned on these _member_ sessions are marked
as such, and will be allowed to propagate. Members will receive some or all learned prefixes from AS50869.
1. FreeIX _members_ can set fine grained BGP communities to determine which of their prefixes are propagated and at
which locations.
Members at smaller internet exchanges greatly benefit from this type of outreach, by receiving large portions of the
public internet directly at their preferred peering location. Similarly, the _Free IX Remote_ routers will carry
their traffic to these remote internet exchanges.
## Detailed Design
### Peer types
There are two types of BGP neighbor adjacency:
1. ***Members***: these are {ip-address,AS}-tuples which FreeIX has explicitly configured. Learned prefixes are added
to as-set AS50869:AS-MEMBERS. Members receive _all_ prefixes from FreeIX, each annotated with BGP **informational**
communities, and members can drive certain behavior with BGP **action** communities.
1. ***Peers***: these are all other entities with whom FreeIX has an adjacency at public internet exchanges or private
network interconnects. Peers receive some (or all) _member prefixes_ from FreeIX and cannot drive any behavior
with communities. With respect to internet exchanges and peers, AS50869 looks like a completely normal ISP,
advertising subsets of the customer AS cone from AS50869:AS-MEMBERS at each exchange point.
BGP sessions with members use strict ingress filtering by means of `bgpq4`, and will be tagged with a set of
informational BGP communities, such as where the prefix was learned, and what propagation permissions that it received
(eg. at which internet exchanges will it be allowed to be announced). Of course, prefixes that are RPKI invalid will be
dropped, while valid and unknown prefixes will be accepted. Members are granted _permissions_ by FreeIX, which determine
where their prefixes will be announced by AS50869. Further, members can perform optional actions by means of BGP communities
at their ingress point, to inhibit announcements to a certain peer or at a given exchange point.
Peers on the other hand are not granted any permissions and all action BGP communities will be stripped on prefixes
learned. Informational communities will still be tagged on learned prefixes. Two things happen here. Firstly, members
will be offered only those prefixes for which they have permission -- in other words, I will create a configuration file
that says member AS8298 may receive prefixes learned from Frys-IX. Secondly, even for those prefixes that are advertised,
the member AS8298 can use the informational communities to further filter what they accept from Free IX Remote AS50869.
# BGP Classic Communities
Members are allowed to set the following legacy action BGP communities for coarse grained distribution of their prefixes
through the FreeIX network.
* `(50869,0)` or `(50869,3000)` do not announce anywhere
* `(50869,666)` or `(65535,666)` blackhole everywhere (can be on any more specific from the member's AS-SET)
* `(50869,3100)` prepend once everywhere
* `(50869,3200)` prepend twice everywhere
* `(50869,3300)` prepend three times everywhere
Peers, on the other hand, are not allowed to set _any_ communities, so all classic BGP communities from them are stripped
on ingress.
# BGP Large Communities
Free IX Remote will use three types of BGP Large Communities, which each serve a distinct purpose:
1. ***Informational***: These communities are set by the FreeIX router when learning a prefix. They cannot be set by
peers or members, and will be stripped on ingress. They will be sent to both members and peers, allowing operators to
choose which prefixes to learn based on their origin details, like which country or internet exchange they were
learned at.
1. ***Permission***: These communities are also set by FreeIX operators when learning a prefix (eg. on the ingress
router). They cannot be set by peers or members, and will be stripped on ingress. The permission communities
determine where FreeIX will allow the prefix to propagate. They will be stripped on egress.
1. ***Action***: Based on the permissions, members can further steer announcements by sending certain action communities
to FreeIX. These actions cannot be sent by peers, but in certain cases they can be set by FreeIX operators on ingress.
Similarly to the _permission_ communties, all _action_ communities will be stripped on egress.
Regular peers of AS50869 at exchange points and private network interconnects will not be able to set any communities,
so all large BGP communities from them are stripped on ingress.
### Informational Communities
When FreeIX routers learn prefixes, they will annotate them with certain communities. For example, the router at
Amsterdam NIKHEF (which is router #1, country #2), when learning a prefix at FrysIX (which is ixp #1152), will set the
following BGP large communities:
* `(50869,1010,1)`: Informational (10XX), Router (1010), vpp0.nlams0.free-ix.net (1)
* `(50869,1020,2)`: Informational (10XX), Country (1020), Netherlands (2)
* `(50869,1030,1152)`: Informational (10XX), IXP (1030), PeeringDB IXP for FrysIX (1152)
When propagating these prefixes to neighbors (both members and peers), these informational communities can be used to
determine local policy, for example by setting a different localpref or dropping prefixes from a certain location.
Informational communities can be read, but they can't be _set_ by peers or members -- they are always cleared by FreeIX
routers when learning prefixes, and as such the only routers which will set them are the FreeIX ones.
### Permission Communities
FreeIX maintains a list of permissions per member. When members announce their prefixes to FreeIX routers, these
permissions communities are set. They determine what the member is allowed to do with FreeIX propagation - notably which
routers, countries, and internet exchanges the member will be allowed to propagate to.
Usually, member prefixes are allowed to propagate everywhere, so the following communities might be set by the FreeIX
router on ingress:
* `(50869,2010,0)`: Permission (20XX), Router (2010), everywhere (0)
* `(50869,2020,0)`: Permission (20XX), Country (2020), everywhere (0)
* `(50869,2030,0)`: Permission (20XX), IXP (2030), everywhere (0)
If the member prefixes are allowed to propagate only to certain places, the 'everywhere' communities will not be set,
and instead lists of communities with finer grained permissions can be used, for example:
* `(50869,2010,2)`: Permission (20XX), Router (2010), vpp0.grskg0.free-ix.net (2)
* `(50869,2020,3)`: Permission (20XX), Country (2020), Greece (3)
* `(50869,2030,60)`: Permission (20XX), IXP (2030), PeeringDB IXP for SwissIX (60)
Permission communities can't be set by peers, nor by members -- they are always cleared by FreeIX routers when learning
prefixes, and are configured explicitly by FreeIX operators.
### Action Communities
Based on the permission communities, zero or more egress routers, countries and internet exchanges are eligible to
propagate member prefixes by AS50869 to its peers. Members can define very fine grained action communities to further
tweak which prefixes propagate on which routers, in which countries and towards which internet exchanges and private
network interconnects:
* `(50869,3010,3)`: Inhibit Action (30XX), Router (3010), vpp0.gratt0.free-ix.net (3)
* `(50869,3020,1)`: Inhibit Action (30XX), Country (3020), Switzerland (1)
* `(50869,3030,1308)`: Inhibit Action (30XX), IXP (3030), PeeringDB IXP for LS-IX (1308)
Further actions can be placed on a per-remote-neighbor basis:
* `(50869,3040,13030)`: Inhibit Action (30XX), AS (3040), Init7 (AS13030)
* `(50869,3041,6939)`: Prepend Action (30XX), Prepend Once (3041), Hurricane Electric (AS6939)
* `(50869,3042,12859)`: Prepend Action (30XX), Prepend Twice (3042), BIT BV (AS12859)
* `(50869,3043,8283)`: Prepend Action (30XX), Prepend Three Times (3043), Coloclue (AS8283)
Peers cannot set these actions, as all action communities will be stripped on ingress. Members can set these action
communities on their sessions with FreeIX routers, however in some cases they may also be set by FreeIX operators when
learning prefixes.
## What's next
{{< image width="200px" float="right" src="/assets/freeix/bird-logo.svg" alt="Bird" >}}
Perhaps this interaction between _informational_, _permission_ and _action_ BGP communities gives you an idea on how
such a network may operate. It's somewhat different to a classic Transit provider, in that AS50869 will not carry a
full table. It'll _merely_ provide a form of partial transit from member A at IXP #1, to and from all peers that
can be found at IXPs #2-#N. Makes the mind boggle? Don't worry, we'll figure it out together :)
In an upcoming article I'll detail the programming work that goes into implementing this complex peering policy in Bird2
as driving VPP routers (duh), with an IGP that is IPv4-less, because at this point, I [[may as well]({%post_url
2024-04-06-vpp-ospf %})] put my money where my mouth is.
If you're interested in this kind of stuff, take a look at the IPng Networks AS8298 [[Routing Policy]({% post_url
2021-11-14-routing-policy %})]. Similar to that one, this one will use a combination of functional programming, templates,
and clever expansions to make a customized per-member and per-peer configuration based on a YAML input file which
dictates which member and which prefix is allowed to go where.
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
First, I need to get a replacement router for the Thessaloniki router, which will run VPP of course. My buddy Antonis
noticed that there are CPU and/or DDR errors on that chassis, so it may need to be RMAd. But once it's operational, I will
start by deploying one instance in Amsterdam NIKHEF, and another in Thessaloniki Balkan Gate, with a 100G connection
between them, graciously provided by [[LANCOM](https://www.lancom.gr/en/)]. Just look at that FD.io hound runnnnn!!1

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,412 @@
---
date: "2024-05-25T12:23:54Z"
title: 'Case Study: NAT64'
---
# Introduction
{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-front-opencase.png" alt="Front" >}}
IPng's network is built up in two main layers, (1) an MPLS transport layer, which is disconnected
from the Internet, and (2) a VPP overlay, which carries the Internet. I created a BGP Free core
transport network, which uses MPLS switches from a company called Centec. These switches offer IPv4,
IPv6, VxLAN, GENEVE and GRE all in silicon, are very cheap on power and relatively affordable per
port.
Centec switches allow for a modest but not huge amount of routes in the hardware forwarding tables.
I loadtested them in [[a previous article]({% post_url 2022-12-05-oem-switch-1 %})] at line rate
(well, at least 8x10G at 64b packets and around 110Mpps), and they forward IPv4, IPv6 and MPLS
traffic effortlessly, at 45 watts.
I wrote more about the Centec switches in [[my review]({% post_url 2023-03-11-mpls-core %})] of them
back in 2022.
### IPng Site Local
{{< image width="400px" float="right" src="/assets/nat64/MPLS IPng Site Local v2.svg" alt="IPng SL" >}}
I leverage this internal transport network for more than _just_ MPLS. The transport switches are
perfectly capable of line rate (at 100G+) IPv4 and IPv6 forwarding as well. When designing IPng Site
Local, I created a number plan that assigns ***IPv4*** from the **198.19.0.0/16** prefix, and ***IPv6***
from the **2001:678:d78:500::/56** prefix. Within these, I allocate blocks for _Loopback_ addresses,
_PointToPoint_ subnets, and hypervisor networks for VMs and internal traffic.
Take a look at the diagram to the right. Each site has one or more Centec switches (in red), and
there are three redundant gateways that connect the IPng Site Local network to the Internet (in
orange). I run lots of services in this red portion of the network: site to site backups
[[Borgbackup](https://www.borgbackup.org/)], ZFS replication [[ZRepl](https://zrepl.github.io/)], a
message bus using [[Nats](https://nats.io)], and of course monitoring with SNMP and Prometheus all
make use of this network. But it's not only internal services like management traffic, I also
actively use this private network to expose _public_ services!
For example, I operate a bunch of [[NGINX Frontends]({% post_url 2023-03-17-ipng-frontends %})] that
have a public IPv4/IPv6 address, and reversed proxy for webservices (like
[[ublog.tech](https://ublog.tech)] or [[Rallly](https://rallly.ipng.ch/)]) which run on VMs and
Docker hosts which don't have public IP addresses. Another example which I wrote about [[last
week]({% post_url 2024-05-17-smtp %})], is a bunch of mail services that run on VMs without public
access, but are each carefully exposed via reversed proxies (like Postfix, Dovecot, or
[[Roundcube](https://webmail.ipng.ch)]). It's an incredibly versatile network design!
### Border Gateways
Seeing as IPng Site Local uses native IPv6, it's rather straight forward to give each hypervisor and
VM an IPv6 address, and configure IPv4 only on the externally facing NGINX Frontends. As a reversed
proxy, NGINX will create a new TCP session to the internal server, and that's a fine solution.
However, I also want my internal hypervisors and servers to have full Internet connectivity. For
IPv6, this feels pretty straight forward, as I can just route the **2001:678:d78:500::/56** through
a firewall that blocks incoming traffic, and call it a day. For IPv4, similarly I can use classic
NAT just like one would in a residential network.
**But what if I wanted to go IPv6-only?** This poses a small challenge, because while IPng is fully
IPv6 capable, and has been since the early 2000s, the rest of the internet is not quite there yet.
For example, the quite popular [[GitHub](https://github.com/pimvanpelt/)] hosting site still has
only an IPv4 address. Come on, folks, what's taking you so long?! It is for this purpose that NAT64
was invented. Described in [[RFC6146](https://datatracker.ietf.org/doc/html/rfc6146)]:
> Stateful NAT64 translation allows IPv6-only clients to contact IPv4 servers using unicast
> UDP, TCP, or ICMP. One or more public IPv4 addresses assigned to a NAT64 translator are shared
> among several IPv6-only clients. When stateful NAT64 is used in conjunction with DNS64, no
> changes are usually required in the IPv6 client or the IPv4 server.
The rest of this article describes version 2 of the IPng SL border gateways, which opens the path
for IPng to go IPv6-only. By the way, I thought it would be super complicated, but in hindsight: I
should have done this years ago!
#### Gateway Design
{{< image width="400px" float="right" src="/assets/nat64/IPng NAT64.svg" alt="IPng Border Gateway" >}}
Let me take a closer look at the orange boxes that I drew in the network diagram above. I call these
machines _Border Gateways_. Their job is to sit between IPng Site Local and the Internet. They'll
each have one network interface connected to the Centec switch, and another connected to
the VPP routers at AS8298. They will provide two main functions: firewalling, so that no unwanted
traffic enters IPng Site local, and NAT translation, so that:
1. IPv4 users from **198.19.0.0/16** can reach external IPv4 addresses,
1. IPv6 users from **2001:678:d78:500::/56** can reach external IPv6,
1. _IPv6-only_ users can reach external **IPv4** addresses, a neat trick.
#### IPv4 and IPv6 NAT
Let me start off with the basic tablestakes. You'll likely be familiar with _masquerading_, a
NAT technique in Linux that uses the public IPv4 address assigned by your provider, allowing
many internal clients, often using [[RFC1918](https://datatracker.ietf.org/doc/html/rfc1918)] addresses,
to access the internet via that shared IPv4 address. You may not have come across IPv6 _masquerading_
though, but it's equally possible to take an internal (private, non-routable)
IPv6 network and access the internet via a shared IPv6 address.
I will assign a pool of four public IPv4 addresses and eight IPv6 addresses to each border gateway:
| **Machine** | **IPv4 pool** | **IPv6 pool** |
| border0.chbtl0.net.ipng.ch | <span style='color:green;'>194.126.235.0/30</span> | <span style='color:blue;'>2001:678:d78::3:0:0/125</span> |
| border0.chrma0.net.ipng.ch | <span style='color:green;'>194.126.235.4/30</span> | <span style='color:blue;'>2001:678:d78::3:1:0/125</span> |
| border0.chplo0.net.ipng.ch | <span style='color:green;'>194.126.235.8/30</span> | <span style='color:blue;'>2001:678:d78::3:2:0/125</span> |
| border0.nlams0.net.ipng.ch | <span style='color:green;'>194.126.235.12/30</span> | <span style='color:blue;'>2001:678:d78::3:3:0/125</span> |
Linux iptables _masquerading_ will only work with the IP addresses assigned to the external
interface, so I will need to use a slightly different approach to be able to use these _pools_. In
case you're wondering -- IPng's internal network has grown to the size now that I cannot expose it
all behind a single IPv4 address; there will not be enough TCP/UDP ports. Luckily, NATing via a pool
is pretty easy using the _SNAT_ module:
```
pim@border0-chrma0:~$ cat << EOF | sudo tee /etc/rc.firewall.ipng-sl
# IPng Site Local: Enable stateful firewalling on IPv4/IPv6 forwarding
iptables -P FORWARD DROP
ip6tables -P FORWARD DROP
iptables -I FORWARD -i enp1s0f1 -m state --state NEW -s 198.19.0.0/16 -j ACCEPT
ip6tables -I FORWARD -i enp1s0f1 -m state --state NEW -s 2001:678:d78:500::/56 -j ACCEPT
iptables -I FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT
ip6tables -I FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT
# IPng Site Local: Enable NAT on external interface using NAT pools
iptables -t nat -I POSTROUTING -s 198.19.0.0/16 -o enp1s0f0 \
-j SNAT --to 194.126.235.4-194.126.235.7
ip6tables -t nat -I POSTROUTING -s 2001:678:d78:500::/56 -o enp1s0f0 \
-j SNAT --to 2001:678:d78::3:1:0-2001:678:d78::3:1:7
EOF
```
From the top -- I'll first make it the default for the kernel to refuse to _FORWARD_ any traffic that
is not explicitly accepted. I will only allow traffic that comes in via `enp1s0f1` (the internal
interface), only if it comes from the assigned IPv4 and IPv6 site local prefixes. On the way back,
I'll allow traffic that matches states created on the way out. This is the _firewalling_ portion of
the setup.
Then, two _POSTROUTING_ rules turn on network address translation. If the source address is any of
the site local prefixes, I'll rewrite it to come from the IPv4 or IPv6 pool addresses, respectively.
This is the _NAT44_ and _NAT66_ portion of the setup.
#### NAT64: Jool
{{< image width="400px" float="right" src="/assets/nat64/jool.png" alt="Jool" >}}
So far, so good. But this article is about NAT64 :-) Here's where I grossly overestimated how
difficult it might be -- and if there's one takeaway from my story here, it should be that NAT64 is
as straight forward as the others! Enter [[Jool](https://jool.mx)], an Open Source SIIT and NAT64
for Linux. It's available in Debian as a DKMS kernel module and userspace tool, and it integrates
cleanly with both _iptables_ and _netfilter_.
Jool is a network address and port translating
implementation, which is referred to as _NAPT_, just as regular IPv4 NAT. When internal IPv6 clients
try to reach an external endpoint, Jool will make note of the internal src6:port, then select an
external IPv4 address:port, rewrite the packet, and on the way back, correlate the src4:port with
the internal src6:port, and rewrite the packet. If this sounds an awful lot like NAT, then you're
not wrong! The only difference is, Jool will also translate the *address family*: it will rewrite
the internal IPv6 addresses to external IPv4 addresses.
Installing Jool is as simple as this:
```
pim@border0-chrma0:~$ sudo apt install jool-dkms jool-tools
pim@border0-chrma0:~$ sudo mkdir /etc/jool
pim@border0-chrma0:~$ cat << EOF | sudo tee /etc/jool/jool.conf
{
"comment": {
"description": "Full NAT64 configuration for border0.chrma0.net.ipng.ch",
"last update": "2024-05-21"
},
"instance": "default",
"framework": "netfilter",
"global": { "pool6": "2001:678:d78:564::/96", "lowest-ipv6-mtu": 1280, "logging-debug": false },
"pool4": [
{ "protocol": "TCP", "prefix": "194.126.235.4/30", "port range": "1024-65535" },
{ "protocol": "UDP", "prefix": "194.126.235.4/30", "port range": "1024-65535" },
{ "protocol": "ICMP", "prefix": "194.126.235.4/30" }
]
}
EOF
pim@border0-chrma0:~$ sudo systemctl start jool
```
.. and that, as they say, is all there is to it! There's two things I make note of here:
1. I have assigned **2001:678:d78:564::/96** as NAT64 `pool6`, which means that if this machine
sees any traffic _destined_ to that prefix, it'll activate Jool, select an available IPv4
address:port from the `pool4`, and send the packet to the IPv4 destination address which it
takes from the last 32 bits of the original IPv6 destination address.
1. Cool trick: I am **reusing** the same IPv4 pool as for regular NAT. The Jool kernel module
happily coexists with the _iptables_ implementation!
#### DNS64: Unbound
There's one vital piece of information missing, and it took me a little while to appreciate that. If
I take an IPv6 only host, like Summer, and I try to connect to an IPv4-only host, how does that even
work?
```
pim@summer:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
eno1 UP 2001:678:d78:50b::f/64 fe80::7e4d:8fff:fe03:3c00/64
pim@summer:~$ ip -6 ro
2001:678:d78:50b::/64 dev eno1 proto kernel metric 256 pref medium
fe80::/64 dev eno1 proto kernel metric 256 pref medium
default via 2001:678:d78:50b::1 dev eno1 proto static metric 1024 pref medium
pim@summer:~$ host github.com
github.com has address 140.82.121.4
pim@summer:~$ ping github.com
ping: connect: Network is unreachable
```
Now comes the really clever reveal -- NAT64 works by assigning an IPv6 prefix that snugly fits the
entire IPv4 address space, typically **64:ff9b::/96**, but operators can chose any prefix they'd like.
For IPng's site local network, I decided to assign **2001:678:d78:564::/96** for this purpose
(this is the `global.pool6` attribute in Jool's config file I described above). A resolver can then
tweak DNS lookups for IPv6-only hosts to return addresses from that IPv6 range. This tweaking is
called DNS64, described in [[RFC6147](https://datatracker.ietf.org/doc/html/rfc6147)]:
> DNS64 is a mechanism for synthesizing AAAA records from A records. DNS64 is used with an
> IPv6/IPv4 translator to enable client-server communication between an IPv6-only client and an
> IPv4-only server, without requiring any changes to either the IPv6 or the IPv4 node, for the
> class of applications that work through NATs.
I run the popular [[Unbound](https://www.nlnetlabs.nl/projects/unbound/about/)] resolver at IPng,
deployed as a set of anycasted instances across the network. With two lines of configuration only, I
can turn on this feature:
```
pim@border0-chrma0:~$ cat << EOF | sudo tee /etc/unbound/unbound.conf.d/dns64.conf
server:
module-config: "dns64 iterator"
dns64-prefix: 2001:678:d78:564::/96
EOF
pim@border0-chrma0:~$ sudo systemctl restat unbound
```
The behavior of the resolver now changes in a very subtle but cool way:
```
pim@summer:~$ host github.com
github.com has address 140.82.121.3
github.com has IPv6 address 2001:678:d78:564::8c52:7903
pim@summer:~$ host 2001:678:d78:564::8c52:7903
3.0.9.7.2.5.c.8.0.0.0.0.0.0.0.0.4.6.5.0.8.7.d.0.8.7.6.0.1.0.0.2.ip6.arpa
domain name pointer lb-140-82-121-3-fra.github.com.
```
Before, [[github.com](https://github.com/pimvanpelt/)] did not return an AAAA record, so there was
no way for Summer to connect to it. But now, not only does it return an AAAA record, but it also
rewrites the PTR request, knowing that I'm asking for something in the DNS64 range of
**2001:678:d78:564::/96**, Unbound will instead strip off the last 32 bits (`8c52:7903`, which is the
hex encoding for the original IPv4 address), and return the answer for a PTR lookup for the original
`3.121.82.140.in-addr.arpa` instead. Game changer!
{{< image width="400px" float="right" src="/assets/nat64/IPng NAT64.svg" alt="IPng Border Gateway" >}}
#### DNS64 + NAT64
What I learned from this, is that the _combination_ of these two tools provides the magic:
1. When an IPv6-only client asks for AAAA for an IPv4-only hostname, Unbound will synthesize an AAAA
from the IPv4 address, casting it into the last 32 bits of its NAT64 prefix **2001:678:d78:564::/96**
1. When an IPv6-only client tries to send traffic to **2001:678:d78:564::/96**, Jool will do the
address family (and address/port) translation. This is represented by the red (ipv6) flow in the
diagram to the right turning into a green (ipv4) flow to the left.
What's left for me to do is to ensure that (a) the NAT64 prefix is routed from IPng Site Local to
the gateways and (b) that the IPv4 and IPv6 NAT address pools is routed from the Internet to the
gateways.
#### Internal: OSPF
I use Bird2 to accomplish the dynamic routing - and considering the Centec switch network is by
design _BGP Free_, I will use OSPF and OSPFv3 for these announcements. Using OSPF has an important
benefit: I can selectively turn on and off the Bird announcements to the Centec IPng Site local
network. Seeing as there will be multiple redundant gateways, if one of them goes down (either due
to failure or because of maintenance), the network will quickly reconverge on another replica. Neat!
Here's how I configure the OSPF import and export filters:
```
filter ospf_import {
if (net.type = NET_IP4 && net ~ [ 198.19.0.0/16 ]) then accept;
if (net.type = NET_IP6 && net ~ [ 2001:678:d78:500::/56 ]) then accept;
reject;
}
filter ospf_export {
if (net.type=NET_IP4 && !(net~[198.19.0.255/32,0.0.0.0/0])) then reject;
if (net.type=NET_IP6 && !(net~[2001:678:d78:564::/96,2001:678:d78:500::1:0/128,::/0])) then reject;
ospf_metric1 = 200; unset(ospf_metric2);
accept;
}
```
When learning prefixes _from_ the Centec switch, I will only accept precisely the IPng Site Local
IPv4 (198.19.0.0/16) and IPv6 (2001:678:d78:500::/56) supernets. On sending prefixes _to_ the Centec
switches, I will announce:
* ***198.19.0.255/32*** and ***2001:678:d78:500::1:0/128***: These are the anycast addresses of the Unbound resolver.
* ***0.0.0.0/0*** and ***::/0***: These are default routes for IPv4 and IPv6 respectively
* ***2001:678:d78:564::/96***: This is the NAT64 prefix, which will attract the IPv6-only traffic
towards DNS64-rewritten destinations, for example 2001:678:d78:564::8c52:7903 as DNS64 representation
of github.com, which is reachable only at legacy address 140.82.121.3.
{{< image width="100px" float="left" src="/assets/nat64/brain.png" alt="Brain" >}}
I have to be careful with the announcements into OSPF. The cost of E1 routes is the cost of the
external metric **in addition to** the internal cost within OSPF to reach that network. The cost
of E2 routes will always be the external metric, the metric will take no notice of the internal
cost to reach that router. Therefor, I emit these prefixes without Bird's `ospf_metric2` set, so
that the closest border gateway is always used.
With that, I can see the following:
```
pim@summer:~$ traceroute6 github.com
traceroute to github.com (2001:678:d78:564::8c52:7903), 30 hops max, 80 byte packets
1 msw0.chbtl0.net.ipng.ch (2001:678:d78:50b::1) 4.134 ms 4.640 ms 4.796 ms
2 border0.chbtl0.net.ipng.ch (2001:678:d78:503::13) 0.751 ms 0.818 ms 0.688 ms
3 * * *
4 * * * ^C
```
I'm not quite there yet, I have one more step to go. What's happening at the Border Gateway? Let me
take a look at this, while I ping6 to github.com:
```
pim@summer:~$ ping6 github.com
PING github.com(lb-140-82-121-4-fra.github.com (2001:678:d78:564::8c52:7904)) 56 data bytes
... (nothing)
pim@border0-chbtl0:~$ sudo tcpdump -ni any src host 2001:678:d78:50b::f or dst host 140.82.121.4
11:25:19.225509 enp1s0f1 In IP6 2001:678:d78:50b::f > 2001:678:d78:564::8c52:7904:
ICMP6, echo request, id 3904, seq 7, length 64
11:25:19.225603 enp1s0f0 Out IP 194.126.235.3 > 140.82.121.4:
ICMP echo request, id 61668, seq 7, length 64
```
Unbound and Jool are doing great work. Unbound saw my DNS request for IPv4-only github.com, and
synthesized a DNS64 response for me. Jool then saw the inbound packet from enp1s0f1, the internal
interface pointed at IPng Site Local. This is because the **2001:678:d78:564::/96** prefix is
announced in OSPFv3 so every host knows to route traffic to that prefix to this border gateway.
But then, I see the NAT64 in action on the outbound interface enp1s0f0. Here, one of the IPv4 pool
addresses is selected as source address. But there is no return packet, because there is no route
back from the Internet, yet.
#### External: BGP
The final step for me is to allow return traffic, from the Internet to the IPv4 and IPv6 pools to
reach this Border Gateway instance. For this, I configure BGP with the following Bird2
configuration snippet:
```
filter bgp_import {
if (net.type = NET_IP4 && !(net = 0.0.0.0/0)) then reject;
if (net.type = NET_IP6 && !(net = ::/0)) then reject;
accept;
}
filter bgp_export {
if (net.type = NET_IP4 && !(net ~ [ 194.126.235.4/30 ])) then reject;
if (net.type = NET_IP6 && !(net ~ [ 2001:678:d78::3:1:0/125 ])) then reject;
# Add BGP Wellknown community no-export (FFFF:FF01)
bgp_community.add((65535,65281));
accept;
}
```
I then establish an eBGP session from private AS64513 to two of IPng Networks' core routers at
AS8298. I add the wellknown BGP no-export community (`FFFF:FF01`) so that these prefixes are learned
in AS8298, but never propagated. It's not strictly necessary, because AS8298 won't announce more
specifics like these anyway, but it's a nice way to really assert that these are meant to stay
local. Because AS8298 is already announcing **194.126.235.0/24** and **2001:678:d78::/48**
supernets, return traffic will already be able to reach IPng's routers upstream. With these more
specific announcements of the /30 and /125 pools, the upstream VPP routers will be able to route the
return traffic to this specific server.
And with that, the ping to Unbound's DNS64 provided IPv6 address for github.com shoots to life.
### Results
I deployed four of these Border Gateways using Ansible: one at my office in Br&uuml;ttisellen, one
in Zurich, one in Geneva and one in Amsterdam. They do all three types of NAT:
* Announcing the IPv4 default **0.0.0.0/0** will allow them to serve as NAT44 gateways for
**198.19.0.0/16**
* Announcing the IPv6 default **::/0** will allow them to serve as NAT66 gateway for
**2001:678:d78:500::/56**
* Announcing the IPv6 nat64 prefix **2001:678:d78:564::/96** will allow them to serve as NAT64 gateway
* Announcing the IPv4 and IPv6 anycast address for `nscache.net.ipng.ch` allows them to serve DNS64
Each individual service can be turned on or off. For example, stopping to announce the IPv4 default
into the Centec network, will no longer attract NAT44 traffic through a replica. Similarly, stopping
to announce the NAT64 prefix will no longer attract NAT64 traffic through that replica. OSPF in the
IPng Site Local network will automatically select an alternative replica in such cases. Shutting
down Bird2 alltogether will immediately drain the machine of all traffic, while traffic is
immediately rerouted.
If you're curious, here's a few minutes of me playing with failover, while watching YouTube videos
concurrently.
{{< image src="/assets/nat64/nat64.gif" alt="Asciinema" >}}
### What's Next
I've added an Ansible module in which I can configure the individual instances' IPv4 and IPv6 NAT
pools, and turn on/off the three NAT types by means of steering the OSPF announcements. I can also
turn on/off the Anycast Unbound announcements, in much the same way.
If you're a regular reader of my stories, you'll maybe be asking: Why didn't you use VPP? And that
would be an excellent question. I need to noodle a little bit more with respect to having all three
NAT types concurrently working alongside Linux CP for the Bird and Unbound stuff, but I think in the
future you might see a followup article on how to do all of this in VPP. Stay tuned!

View File

@ -0,0 +1,624 @@
---
date: "2024-06-22T09:17:54Z"
title: VPP with loopback-only OSPFv3 - Part 2
---
{{< image width="200px" float="right" src="/assets/vpp-ospf/bird-logo.svg" alt="Bird" >}}
# Introduction
When I first built IPng Networks AS8298, I decided to use OSPF as an IPv4 and IPv6 internal gateway
protocol. Back in March I took a look at two slightly different ways of doing this for IPng, notably
against a backdrop of conserving IPv4 addresses. As the network grows, the little point to point
transit networks between routers really start adding up.
I explored two potential solutions to this problem:
1. **[[Babel]({% post_url 2024-03-06-vpp-babel-1 %})]** can use IPv6 nexthops for IPv4 destinations -
which is _super_ useful because it would allow me to retire all of the IPv4 /31 point to point
networks between my routers.
1. **[[OSPFv3]({% post_url 2024-04-06-vpp-ospf %})]** makes it difficult to use IPv6 nexthops for
IPv4 destinations, but in a discussion with the Bird Users mailinglist, we found a way: by reusing
a single IPv4 loopback address on adjacent interfaces
{{< image width="90px" float="left" src="/assets/vpp-ospf/canary.png" alt="Canary" >}}
In May I ran a modest set of two _canaries_, one between the two routers in my house (`chbtl0` and
`chbtl1`), and another between a router at the Daedalean colocation and Interxion datacenters (`ddln0`
and `chgtg0`). AS8298 has about quarter of a /24 tied up in these otherwise pointless point-to-point
transit networks (see what I did there?). I want to reclaim these!
Seeing as the two tests went well, I decided to roll this out and make it official. This post
describes how I rolled out an (almost) IPv4-less core network for IPng Networks. It was actually way
easier than I had anticipated, and apparently I was not alone - several of my buddies in the
industry have asked me about it, so I thought I'd write a little bit about the configuration.
# Background: OSPFv3 with IPv4
***💩 /30: 4 addresses***: In the oldest of days, two routers that formed an IPv4 OSPF adjacency would
have a /30 _point-to-point_ transit network between them. Router A would have the lower available
IPv4 address, and Router B would have the upper available IPv4 address. The other two addresses in
the /30 would be the _network_ and _broadcast_ addresses of the prefix. Not a very efficient way to
do things, but back in the old days, IPv4 addresses were in infinite supply.
***🥈 /31: 2 addresses***: Enter [[RFC3021](https://datatracker.ietf.org/doc/html/rfc3021)], from
December 2000, which some might argue are also the old days. With ever-increasing pressure to
conserve IP address space on the Internet, it makes sense to consider where relatively minor changes
can be made to fielded practice to improve numbering efficiency. This RFC describes how to halve the
amount of address space assigned to point-to-point links (common throughout the Internet
infrastructure) by allowing the use of /31 prefixes for them. At some point, even our friends from
Latvia figured it out!
***🥇 /32: 1 address***: In most networks, each router has what is called a _loopback_ IPv4 and IPv6
address, typically a /32 and /128 in size. This allows the router to select a unique address that is
not bound to any given interface. It comes in handy in many ways -- for example to have stable
addresses to manage the router, and to allow it to connect to iBGP route reflectors and peers from
well known addresses.
As it so turns out, two routers that form an adjacency can advertise ~any IPv4 address as nexthop,
provided that their adjacent peer knows how to find that address. Of course, with a /30 or /31 this
is obvious: if I have a directly connected /31, I can simply ARP for the other side, learn its MAC
address, and use that to forward traffic to the other router.
### The Trick
What would it look like if there's no subnet that directly connects two adjacent routers? Well, I
happen to know that RouterA and RouterB both have a /32 loopback address. So if I simply let RouterA
(1) advertise _its loopback_ address to neighbor RouterB, and also (2) answer ARP requests for that
address, the two routers should be able to form an adjacency. This is exactly what Ondrej's [[Bird2
commit (1)](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)] and
my [[VPP gerrit (2)](https://gerrit.fd.io/r/c/vpp/+/40482)] accomplish, as perfect partners:
1. Ondrej's change will make the Link LSA be _onlink_, which is a way to describe that the next hop
is not directly connected, in other words RouterB will be at nexthop `192.0.2.1`, while
RouterA itself is `192.0.2.0/32`.
1. My change will make VPP answer for ARP requests in such a scenario where RouterA with an
_unnumbered_ interface with `192.0.2.0/32` will respond to a request from the not directly
connected _onlink_ peer RouterB at `192.0.2.1`.
## Rolling out P2P-less OSPFv3
### 1. Upgrade VPP + Bird2
First order of business is to upgrade all routers. I need a VPP version with the [[ARP
gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)] and a Bird2 version with the [[OSPFv3
commit](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)]. I build
a set of Debian packages on `bookworm-builder` and upload them to IPng's website
[[ref](https://ipng.ch/media/vpp/bookworm/)].
I schedule a two nightly maintenance windows. In the first one, I'll upgrade two routers (`frggh0`
and `ddln1`) by means of canary. I'll let them run for a few days, and then wheel over the rest
after I'm confident there are no regressions.
For each router, I will first _drain it_: this means in Kees, setting the OSPFv2 and OSPFv3 cost of
routers neighboring it to a higher number, so that traffic flows around the 'expensive' link. I will
also move the eBGP sessions into _shutdown_ mode, which will make the BGP sessions stay connected,
but the router will not announce any prefixes nor accept any from peers. Without it announcing or
learning any prefixes, the router stops seeing traffic. After about 10 minutes, it is safe to make
intrusive changes to it.
Seeing as I'll be moving from OSPFv2 to OSPFv3, I will allow for a seemless transition by
configuring both protocols to run at the same time. The filter that applies to both flavors of OSPF
is the same: I will only allow more specifics of IPng's own prefixes to be propagated, and in
particular I'll drop all prefixes that come from BGP. I'll rename the protocol called `ospf4` to
`ospf4_old`, and create a new (OSPFv3) protocol called `ospf4` which has only the loopback interface
in it. This way, when I'm done, the final running protocol will simply be called `ospf4`:
```
filter f_ospf {
if (source = RTS_BGP) then reject;
if (net ~ [ 92.119.38.0/24{25,32}, 194.1.163.0/24{25,32}, 194.126.235.0/24{25,32} ]) then accept;
if (net ~ [ 2001:678:d78::/48{56,128}, 2a0b:dd80:3000::/36{48,48} ]) then accept;
reject;
}
protocol ospf v2 ospf4_old {
ipv4 { export filter f_ospf; import filter f_ospf; };
area 0 {
interface "loop0" { stub yes; };
interface "xe1-1.302" { type pointopoint; cost 61; bfd on; };
interface "xe1-0.304" { type pointopoint; cost 56; bfd on; };
};
}
protocol ospf v3 ospf4 {
ipv4 { export filter f_ospf; import filter f_ospf; };
area 0 {
interface "loop0","lo" { stub yes; };
};
}
```
In one terminal, I will start a ping to the router's IPv4 loopback:
```
pim@summer:~$ ping defra0.ipng.ch
PING (194.1.163.7) 56(84) bytes of data.
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=1 ttl=61 time=6.94 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=2 ttl=61 time=7.00 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=3 ttl=61 time=7.03 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=4 ttl=61 time=7.03 ms
...
```
While in the other, I log in to the IPng Site Local connection to the router's management plane, to
perform the ugprade:
```
pim@squanchy:~$ ssh defra0.net.ipng.ch
pim@defra0:~$ wget -m --no-parent https://ipng.ch/media/vpp/bookworm/24.06-rc0~183-gb0d433978/
pim@defra0:~$ cd ipng.ch/media/vpp/bookworm/24.06-rc0~183-gb0d433978/
pim@defra0:~$ sudo nsenter --net=/var/run/netns/dataplane
root@defra0:~# pkill -9 vpp && systemctl stop bird-dataplane vpp && \
dpkg -i ~pim/ipng.ch/media/vpp/bookworm/24.06-rc0~183-gb0d433978/*.deb && \
dpkg -i ~pim/bird2_2.15.1_amd64.deb && \
systemctl start bird-dataplane && \
systemctl restart vpp-snmp-agent-dataplane vpp-exporter-dataplane
```
Then comes the small window of awkward staring at the ping I started in the other terminal. It
always makes me smile because it all comes back very quickly, within 90 seconds the router is back
online and fully converged with BGP:
```
pim@summer:~$ ping defra0.ipng.ch
PING (194.1.163.7) 56(84) bytes of data.
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=1 ttl=61 time=6.94 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=2 ttl=61 time=7.00 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=3 ttl=61 time=7.03 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=4 ttl=61 time=7.03 ms
...
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=94 ttl=61 time=1003.83 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=95 ttl=61 time=7.03 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=96 ttl=61 time=7.02 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=97 ttl=61 time=7.03 ms
pim@defra0:~$ birdc show ospf nei
BIRD v2.15.1-4-g280daed5-x ready.
ospf4_old:
Router ID Pri State DTime Interface Router IP
194.1.163.8 1 Full/PtP 32.113 xe1-1.302 194.1.163.27
194.1.163.0 1 Full/PtP 30.936 xe1-0.304 194.1.163.24
ospf4:
Router ID Pri State DTime Interface Router IP
ospf6:
Router ID Pri State DTime Interface Router IP
194.1.163.8 1 Full/PtP 32.113 xe1-1.302 fe80::3eec:efff:fe46:68a8
194.1.163.0 1 Full/PtP 30.936 xe1-0.304 fe80::6a05:caff:fe32:4616
```
I can see that the OSPFv2 adjacencies have reformed, which is totally expected. Looking at the
router's current addresses:
```
pim@defra0:~$ ip -br a | grep UP
loop0 UP 194.1.163.7/32 2001:678:d78::7/128 fe80::dcad:ff:fe00:0/64
xe1-0 UP fe80::6a05:caff:fe32:3e48/64
xe1-1 UP fe80::6a05:caff:fe32:3e49/64
xe1-2 UP fe80::6a05:caff:fe32:3e4a/64
xe1-3 UP fe80::6a05:caff:fe32:3e4b/64
xe1-0.304@xe1-0 UP 194.1.163.25/31 2001:678:d78::2:7:2/112 fe80::6a05:caff:fe32:3e48/64
xe1-1.302@xe1-1 UP 194.1.163.26/31 2001:678:d78::2:8:1/112 fe80::6a05:caff:fe32:3e49/64
xe1-2.441@xe1-2 UP 46.20.246.51/29 2a02:2528:ff01::3/64 fe80::6a05:caff:fe32:3e4a/64
xe1-2.503@xe1-2 UP 80.81.197.38/21 2001:7f8::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64
xe1-2.514@xe1-2 UP 185.1.210.235/23 2001:7f8:3d::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64
xe1-2.515@xe1-2 UP 185.1.208.84/23 2001:7f8:44::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64
xe1-2.516@xe1-2 UP 185.1.171.43/23 2001:7f8:9e::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64
xe1-3.900@xe1-3 UP 193.189.83.55/23 2001:7f8:33::a100:8298:1/64 fe80::6a05:caff:fe32:3e4b/64
xe1-3.2003@xe1-3 UP 185.1.155.116/24 2a0c:b641:701::8298:1/64 fe80::6a05:caff:fe32:3e4b/64
xe1-3.3145@xe1-3 UP 185.1.167.136/23 2001:7f8:f2:e1::8298:1/64 fe80::6a05:caff:fe32:3e4b/64
xe1-3.1405@xe1-3 UP 80.77.16.214/30 2a00:f820:839::2/64 fe80::6a05:caff:fe32:3e4b/64
```
Take a look at interfaces `xe1-0.304` which is southbound from Frankfurt to Zurich
(`chrma0.ipng.ch`) and `xe1-1.302` which is northbound from Frankfurt to Amsterdam
(`nlams0.ipng.ch`). I am going to get rid of the IPv4 and IPv6 global unicast addresses on these two
interfaces, and let OSPFv3 borrow the IPv4 address from `loop0` instead.
But first, rinse and repeat, until all routers are upgraded.
### 2. A situational overview
First, let me draw a diagram that helps show what I'm about to do:
{{< image src="/assets/vpp-ospf/BTL-GTG-RMA Before.svg" alt="Step 2: Before" >}}
In the network overview I've drawn four of IPng's routers. The ones at the bottom are the two
routers at my office in Br&uuml;ttisellen, Switzerland, which explains their name `chbtl0` and
`chbtl1`, and they are connected via a local fiber trunk using 10Gig optics (drawn in <span
style='color:red;font-weight:bold;'>red</span>). On the left, the first router is connected via a
10G Ethernet-over-MPLS link (depicted in <span style='color:green;font-weight:bold;'>green</span>)
to the NTT Datacenter in R&uuml;mlang. From there, IPng rents a 25Gbps wavelength to the Interxion
datacenter in Glattbrugg (shown in <span style='color:blue;font-weight:bold;'>blue</span>). Finally,
the Interxion router connects back to Br&uuml;ttisellen using a 10G Ethernet-over-MPLS link (colored
in <span style='color:#ff00ff;font-weight:bold;'>pink</span>), completing the ring.
You can also see that each router has a set of _loopback_ addresses, for example `chbtl0` in the
bottom left has IPv4 address `194.1.163.3/32` and IPv6 address `2001:678:d78::3/128`. Each point to
point network has assigned one /31 and one /112 with each router taking one address at either side.
Counting them up real quick, I see **twelve IPv4 addresses** in this diagram. This is a classic OSPF
design pattern. I seek to save eight of these addresses!
### 3. First OSPFv3 link
The rollout has to start somewhere, and I decide to start close to home, literally. I'm going to
remove the IPv4 and IPv6 addresses from the <span style='color:red;font-weight:bold;'>red</span> link between the two
routers in Br&uuml;ttisellen. They are directly connected, and if anything goes wrong, I can walk
over and rescue them. Sounds like a safe way to start!
I quickly add the ability for [[vppcfg](https://github.com/pimvanpelt/vppcfg)] to configure
_unnumbered_ interfaces. In VPP, these are interfaces that don't have an IPv4 or IPv6 address of
their own, but they borrow one from another interface. If you're curious, you can take a look at the
[[User Guide](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
GitHub.
Looking at their `vppcfg` files, the change is actually very easy, taking as an example the
configuration file for `chbtl0.ipng.ch`:
```
loopbacks:
loop0:
description: 'Core: chbtl1.ipng.ch'
addresses: ['194.1.163.3/32', '2001:678:d78::3/128']
lcp: loop0
mtu: 9000
interfaces:
TenGigabitEthernet6/0/0:
device-type: dpdk
description: 'Core: chbtl1.ipng.ch'
mtu: 9000
lcp: xe1-0
# addresses: [ '194.1.163.20/31', '2001:678:d78::2:5:1/112' ]
unnumbered: loop0
```
By commenting out the `addresses` field, and replacing it with `unnumbered: loop0`, I instruct
vppcfg to make Te6/0/0, which in Linux is called `xe1-0`, borrow its addresses from the loopback
interface `loop0`.
{{< image width="100px" float="left" src="/assets/freebsd-vpp/brain.png" alt="brain" >}}
Planning and applying this is straight forward, but there's one detail I should
mention. In my [[previous article]({% post_url 2024-04-06-vpp-ospf %})] I asked myself a question:
would it be better to leave the addresses unconfigured in Linux, or would it be better to make the
Linux Control Plane plugin carry forward the borrowed addresses? In the end, I decided to _not_ copy
them forward. VPP will be aware of the addresses, but Linux will only carry them on the `loop0`
interface.
In the article, you'll see that discussed as _Solution 2_, and it includes a bit of rationale why I
find this better. I implemented it in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
case you're curious, and the commandline keyword is `lcp lcp-sync-unnumbered off` (the default is
_on_).
```
pim@chbtl0:~$ vppcfg plan -c /etc/vpp/vppcfg.yaml
[INFO ] root.main: Loading configfile /etc/vpp/vppcfg.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.vppapi.connect: VPP version is 24.06-rc0~183-gb0d433978
comment { vppcfg prune: 2 CLI statement(s) follow }
set interface ip address del TenGigabitEthernet6/0/0 194.1.163.20/31
set interface ip address del TenGigabitEthernet6/0/0 2001:678:d78::2:5:1/112
comment { vppcfg sync: 1 CLI statement(s) follow }
set interface unnumbered TenGigabitEthernet6/0/0 use loop0
[INFO ] vppcfg.reconciler.write: Wrote 5 lines to (stdout)
[INFO ] root.main: Planning succeeded
pim@chbtl0:~$ vppcfg show int addr TenGigabitEthernet6/0/0
TenGigabitEthernet6/0/0 (up):
unnumbered, use loop0
L3 194.1.163.3/32
L3 2001:678:d78::3/128
pim@chbtl0:~$ vppctl show lcp | grep TenGigabitEthernet6/0/0
itf-pair: [9] TenGigabitEthernet6/0/0 tap9 xe1-0 65 type tap netns dataplane
pim@chbtl0:~$ ip -br a | grep UP
xe0-0 UP fe80::92e2:baff:fe3f:cad4/64
xe0-1 UP fe80::92e2:baff:fe3f:cad5/64
xe0-1.400@xe0-1 UP fe80::92e2:baff:fe3f:cad4/64
xe0-1.400.10@xe0-1.400 UP 194.1.163.16/31 2001:678:d78:2:3:1/112 fe80::92e2:baff:fe3f:cad4/64
xe1-0 UP fe80::21b:21ff:fe55:1dbc/64
xe1-1.101@xe1-1 UP 194.1.163.65/27 2001:678:d78:3::1/64 fe80::14b4:c6ff:fe1e:68a3/64
xe1-1.179@xe1-1 UP 45.129.224.236/29 2a0e:5040:0:2::236/64 fe80::92e2:baff:fe3f:cad5/64
```
After applying this configuration, I can see that Te6/0/0 indeed is _unnumbered, use loop0_ noting
the IPv4 and IPv6 addresses that it borrowed. I can see with the second command that Te6/0/0
corresponds in Linux with `xe1-0`, and finally with the third command I can list the addresses of
the Linux view, and indeed I confirm that `xe1-0` only has a link local address. Slick!
After applying this change, the OSPFv2 adjacency in the `ospf4_old` protocol expires, and I see the
routing table converge. A traceroute between `chbtl0` and `chbtl1` now takes a bit of a detour:
```
pim@chbtl0:~$ traceroute chbtl1.ipng.ch
traceroute to chbtl1 (194.1.163.4), 30 hops max, 60 byte packets
1 chrma0.ipng.ch (194.1.163.17) 0.981 ms 0.969 ms 0.953 ms
2 chgtg0.ipng.ch (194.1.163.9) 1.194 ms 1.192 ms 1.176 ms
3 chbtl1.ipng.ch (194.1.163.4) 1.875 ms 1.866 ms 1.911 ms
```
I can now introduce the very first OSPFv3 adjacency for IPv4, and I do this by moving the neighbor
from the `ospf4_old` protocol to the `ospf4` prototol. Of course, I also update chbtl1 with the
_unnumbered_ interface on its `xe1-0`, and update OSPF there. And with that, something magical
happens:
```
pim@chbtl0:~$ birdc show ospf nei
BIRD v2.15.1-4-g280daed5-x ready.
ospf4_old:
Router ID Pri State DTime Interface Router IP
194.1.163.0 1 Full/PtP 30.571 xe0-1.400.10 fe80::266e:96ff:fe37:934c
ospf4:
Router ID Pri State DTime Interface Router IP
194.1.163.4 1 Full/PtP 31.955 xe1-0 fe80::9e69:b4ff:fe61:ff18
ospf6:
Router ID Pri State DTime Interface Router IP
194.1.163.4 1 Full/PtP 31.955 xe1-0 fe80::9e69:b4ff:fe61:ff18
194.1.163.0 1 Full/PtP 30.571 xe0-1.400.10 fe80::266e:96ff:fe37:934c
pim@chbtl0:~$ birdc show route protocol ospf4
BIRD v2.15.1-4-g280daed5-x ready.
Table master4:
194.1.163.4/32 unicast [ospf4 2024-05-19 20:58:04] * I (150/2) [194.1.163.4]
via 194.1.163.4 on xe1-0 onlink
194.1.163.64/27 unicast [ospf4 2024-05-19 20:58:04] E2 (150/2/10000) [194.1.163.4]
via 194.1.163.4 on xe1-0 onlink
```
Aww, would you look at that! Especially the first entry is interesting to me. It says that this
router has learned the address `194.1.163.4/32`, the loopback address of `chbtl1` via nexthop
**also** `194.1.163.4` on interface `xe1-0` with a flag _onlink_.
The kernel routing table agrees with this construction:
```
pim@chbtl0:~$ ip ro get 194.1.163.4
194.1.163.4 via 194.1.163.4 dev xe1-0 src 194.1.163.3 uid 1000
cache
```
Now, what this construction tells the kernel, is that it should ARP for `194.1.163.4` using local
address `194.1.163.3`, for which VPP on the other side will respond, thanks to my [[VPP ARP
gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]. As such, I should expect now a FIB entry for VPP:
```
pim@chbtl0:~$ vppctl show ip fib 194.1.163.4
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ]
194.1.163.4/32 fib:0 index:973099 locks:3
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
path-list:[189] locks:98 flags:shared,popular, uPRF-list:507 len:1 itfs:[36, ]
path:[166] pl-index:189 ip4 weight=1 pref=32 attached-nexthop: oper-flags:resolved,
194.1.163.4 TenGigabitEthernet6/0/0
[@0]: ipv4 via 194.1.163.4 TenGigabitEthernet6/0/0: mtu:9000 next:10 flags:[] 90e2ba3fcad4246e9637934c810001908100000a0800
adjacency refs:1 entry-flags:attached, src-flags:added, cover:-1
path-list:[1025] locks:1 uPRF-list:1521 len:1 itfs:[36, ]
path:[379] pl-index:1025 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
194.1.163.4 TenGigabitEthernet6/0/0
[@0]: ipv4 via 194.1.163.4 TenGigabitEthernet6/0/0: mtu:9000 next:10 flags:[] 90e2ba3fcad4246e9637934c810001908100000a0800
Extensions:
path:379
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:848961 buckets:1 uRPF:507 to:[1966944:611861009]]
[0] [@5]: ipv4 via 194.1.163.4 TenGigabitEthernet6/0/0: mtu:9000 next:10 flags:[] 90e2ba3fcad4246e9637934c810001908100000a0800
```
Nice work, VPP and Bird2! I confirm that I can ping the neighbor again, and that the traceroute is
direct rather than the scenic route from before, and I validate that IPv6 still works for good
measure:
```
pim@chbtl0:~$ ping -4 chbtl1.ipng.ch
PING 194.1.163.4 (194.1.163.4) 56(84) bytes of data.
64 bytes from 194.1.163.4: icmp_seq=1 ttl=63 time=0.169 ms
64 bytes from 194.1.163.4: icmp_seq=2 ttl=63 time=0.283 ms
64 bytes from 194.1.163.4: icmp_seq=3 ttl=63 time=0.232 ms
64 bytes from 194.1.163.4: icmp_seq=4 ttl=63 time=0.271 ms
^C
--- 194.1.163.4 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 0.163/0.233/0.276/0.045 ms
pim@chbtl0:~$ traceroute chbtl1.ipng.ch
traceroute to chbtl1 (194.1.163.4), 30 hops max, 60 byte packets
1 chbtl1.ipng.ch (194.1.163.4) 0.190 ms 0.176 ms 0.147 ms
pim@chbtl0:~$ ping6 chbtl1.ipng.ch
PING chbtl1.ipng.ch(chbtl1.ipng.ch (2001:678:d78::4)) 56 data bytes
64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=1 ttl=64 time=0.205 ms
64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=2 ttl=64 time=0.203 ms
64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=3 ttl=64 time=0.213 ms
64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=4 ttl=64 time=0.219 ms
^C
--- chbtl1.ipng.ch ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3068ms
rtt min/avg/max/mdev = 0.203/0.210/0.219/0.006 ms
pim@chbtl0:~$ traceroute6 chbtl1.ipng.ch
traceroute to chbtl1.ipng.ch (2001:678:d78::4), 30 hops max, 80 byte packets
1 chbtl1.ipng.ch (2001:678:d78::4) 0.163 ms 0.147 ms 0.124 ms
```
### 4. From one to two
{{< image width="300px" float="right" src="/assets/vpp-ospf/OSPFv3 Rollout Step 4.svg" alt="Step 4: Canary" >}}
At this point I have two IPv4 IGPs running. This is not ideal, but it's also not completely broken,
because the OSPF filter allows the routers to learn and propagate any more specific prefix from
`194.1.163.0/24`. This way, the legacy OSPFv2 called `ospf4_old` and this new OSPFv3 called `ospf4`
will be aware of all routes. Bird will learn them twice, and routing decisions may be a bit funky
because the OSPF protocols learn the routes from each other as OSPF-E2. There are two implications
of this:
1. It means that the routes that are learned from the other OSPF protocol will have a fixed metric
(==cost), and for the time being, I won't be able to cleanly add up link costs between the
routers that are speaking OSPFv2 and those that are speaking OSPFv3.
1. If an OSPF External Type E1 and Type E2 route exist to the same destination the E1 route will
always be preferred irrespective of the metric. This means that within the routers that speak
OSPFv2, cost will remain consistent; and also within the routers that speak OSPFv3, it will be
consistent. Between them, routes will be learned, but cost will be roughly meaningless.
I upgrade another link, between router `chgtg0` and `ddln0` at my [[colo]({% post_url
2022-02-24-colo %})], which is connected via a 10G EoMPLS link from a local telco called Solnet. The
colo, similar to IPng's office, has two redundant 10G uplinks, so if things were to fall apart, I
can always quickly shutdown the offending link (thereby removing OSPFv3 adjacencies), and traffic
will reroute. I have created two islands of OSPFv3, drawn in <span
style='color:orange;font-weight:bold'>orange</span>, with exactly two links using IPv4-less point to
point networks. I let this run for a few weeks, to make sure things do not fail in mysterious ways.
### 5. From two to many
{{< image width="300px" float="right" src="/assets/vpp-ospf/OSPFv3 Rollout Step 5.svg" alt="Step 5: Zurich" >}}
From this point on it's just rinse-and-repeat. For each backbone link, I will:
1. I will drain the backbone link I'm about to work on, by raising OSPFv2 and OSPFv3 cost on both
sides. If the cost was, say, 56, I will temporarily make that 1056. This will make traffic
avoid using the link if at all possible. Due to redundancy, every router has (at least) two
backbone links. Traffic will be diverted.
1. I first change the VPP router's `vppcfg.yaml` to remove the p2p addresses and replace them with
an `unnumbered: loop0` instead. I apply the diff, and the OSPF adjacency breaks for IPv4.
The BFD adjacency for IPv4 will disappear. Curiously, the IPv6 adjacency stays up, because
OSPFv3 adjacencies use link-local addresses.
1. I move the interface section of the old OSPFv2 `ospf4_old` protocol to the new OSPFv3
`ospf4` protocol, which will als use link-local addresses to form adjacencies. The two routers
will exchange Link LSA and be able to find each other directly connected. Now the link is
running **two** OSPFv3 protocols, each in their own address family. They will share the same BFD
session.
1. I finally undrain the link by setting the OSPF link cost back to what it was. This link is now
a part of the OSPFv3 part of the network.
I work my way through the network. The first one I do is the link between `chgtg0` and `chbtl1`
(which I've colored in the diagram in <span style='color:#ff00ff;font-weight:bold'>pink</span>), so
that there are four contiguous OSPFv3 links, spanning from chbtl0 - chbtl1 - chgtg0 - ddln0. I
constantly do a traceroute to a machine that is directly connected behind ddln0, and as well use
RIPE Atlas and the NLNOG Ring to ensure that I have reachability:
```
pim@squanchy:~$ traceroute ipng.mm.fcix.net
traceroute to ipng.mm.fcix.net (194.1.163.59), 64 hops max, 40 byte packets
1 chbtl0 (194.1.163.65) 0.279 ms 0.362 ms 0.249 ms
2 chbtl1 (194.1.163.3) 0.455 ms 0.394 ms 0.384 ms
3 chgtg0 (194.1.163.1) 1.302 ms 1.296 ms 1.294 ms
4 ddln0 (194.1.163.5) 2.232 ms 2.385 ms 2.322 ms
5 mm0.ddln0.ipng.ch (194.1.163.59) 2.377 ms 2.577 ms 2.364 ms
```
I work my way outwards from there. First completing the ring chbtl0 - chrma0 - chgtg0 - chbtl1, and
then completing the ring ddln0 - ddln1 - chrma0 - chgtg0, after which the Zurich metro area is
converted. I then work my way clockwise from Zurich to Geneva, Paris, Lille, Amsterdam, Frankfurt,
and end up with the last link completing the set: defra0 - chrma0.
## Results
{{< image width="300px" float="right" src="/assets/vpp-ospf/OSPFv3 Rollout After.svg" alt="OSPFv3: After" >}}
In total I reconfigure thirteen backbone links, and they all become _unnumbered_ using the router's
loopback addresses for IPv4 and IPv6, and they all switch over from their OSPFv2 IGP to the new
OSPFv3 IGP; the total number of routers running the old IGP shrinks until there are none left. Once
that happens, I can simply remove the OSPFv2 protocol called `ospf4_old`, and keep the two OSPFv3
protocols now intuitively called `ospf4` and `ospf6`. Nice.
This maintenance isn't super intrusive. For IPng's customers, latency goes up from time to time as
backbone links are drained, the link is reconfigured to become unnumbered and OSPFv3, and put back
into service. The whole operation takes a few hours, and I enjoy the repetitive tasks, getting
pretty good at the drain-reconfigure-undrain cycle after a while.
It looks really cool on transit routers, like this one in Lille, France:
```
pim@frggh0:~$ ip -br a | grep UP
loop0 UP 194.1.163.10/32 2001:678:d78::a/128 fe80::dcad:ff:fe00:0/64
xe0-0 UP 193.34.197.143/25 2001:7f8:6d::8298:1/64 fe80::3eec:efff:fe70:24a/64
xe0-1 UP fe80::3eec:efff:fe70:24b/64
xe1-0 UP fe80::6a05:caff:fe32:45ac/64
xe1-1 UP fe80::6a05:caff:fe32:45ad/64
xe1-2 UP fe80::6a05:caff:fe32:45ae/64
xe1-2.100@xe1-2 UP fe80::6a05:caff:fe32:45ae/64
xe1-2.200@xe1-2 UP fe80::6a05:caff:fe32:45ae/64
xe1-2.391@xe1-2 UP 46.20.247.3/29 2a02:2528:ff03::3/64 fe80::6a05:caff:fe32:45ae/64
xe0-1.100@xe0-1 UP 194.1.163.137/29 2001:678:d78:6::1/64 fe80::3eec:efff:fe70:24b/64
pim@frggh0:~$ birdc show bfd ses
BIRD v2.15.1-4-g280daed5-x ready.
bfd1:
IP address Interface State Since Interval Timeout
fe80::3eec:efff:fe46:68a9 xe1-2.200 Up 2024-06-19 20:16:58 0.100 3.000
fe80::6a05:caff:fe32:3e38 xe1-2.100 Up 2024-06-19 20:13:11 0.100 3.000
pim@frggh0:~$ birdc show ospf nei
BIRD v2.15.1-4-g280daed5-x ready.
ospf4:
Router ID Pri State DTime Interface Router IP
194.1.163.9 1 Full/PtP 34.947 xe1-2.100 fe80::6a05:caff:fe32:3e38
194.1.163.8 1 Full/PtP 31.940 xe1-2.200 fe80::3eec:efff:fe46:68a9
ospf6:
Router ID Pri State DTime Interface Router IP
194.1.163.9 1 Full/PtP 34.947 xe1-2.100 fe80::6a05:caff:fe32:3e38
194.1.163.8 1 Full/PtP 31.940 xe1-2.200 fe80::3eec:efff:fe46:68a9
```
You can see here that the router indeed has an IPv4 loopback address 194.1.163.10/32, and
2001:678:d78::a/128. It has two backbone links, on `xe1-2.100` towards Paris and `xe1-2.200`
towards Amsterdam. Judging by the time between the BFD sessions, it took me somewhere around four
minutes to drain, reconfigure, and undrain each link. I kept on listening to Nora en Pure's
[[Episode #408](https://www.youtube.com/watch?v=AzfCrOEW7e8)] the whole time.
### A traceroute
The beauty of this solution is that the routers will still have one IPv4 and IPv6 address, from
their `loop0` interface. The VPP dataplane will use this when generating ICMP error messages, for
example in a traceroute. It will look quite normal:
```
pim@squanchy:~/src/ipng.ch$ traceroute bit.nl
traceroute to bit.nl (213.136.12.97), 30 hops max, 60 byte packets
1 chbtl0.ipng.ch (194.1.163.65) 0.366 ms 0.408 ms 0.393 ms
2 chrma0.ipng.ch (194.1.163.0) 1.219 ms 1.252 ms 1.180 ms
3 defra0.ipng.ch (194.1.163.7) 6.943 ms 6.887 ms 6.922 ms
4 nlams0.ipng.ch (194.1.163.8) 12.882 ms 12.835 ms 12.910 ms
5 as12859.frys-ix.net (185.1.203.186) 14.028 ms 14.160 ms 14.436 ms
6 http-bit-ev-new.lb.network.bit.nl (213.136.12.97) 14.098 ms 14.671 ms 14.965 ms
pim@squanchy:~$ traceroute6 bit.nl
traceroute6 to bit.nl (2001:7b8:3:5::80:19), 64 hops max, 60 byte packets
1 chbtl0.ipng.ch (2001:678:d78:3::1) 0.871 ms 0.373 ms 0.304 ms
2 chrma0.ipng.ch (2001:678:d78::) 1.418 ms 1.387 ms 1.764 ms
3 defra0.ipng.ch (2001:678:d78::7) 6.974 ms 6.877 ms 6.912 ms
4 nlams0.ipng.ch (2001:678:d78::8) 13.023 ms 13.014 ms 13.013 ms
5 as12859.frys-ix.net (2001:7f8:10f::323b:186) 14.322 ms 14.181 ms 14.827 ms
6 http-bit-ev-new.lb.network.bit.nl (2001:7b8:3:5::80:19) 14.176 ms 14.24 ms 14.093 ms
```
The only difference from before is that now, these traceroute hops are from the loopback addresses,
not the P2P transit links (eg the second hop, through `chrma0` is now 194.1.163.0 and 2001:678:d78::
respectively, where before that would have been 194.1.163.17 and 2001:678:d78::2:3:2 respectively.
Subtle, but super dope.
### Link Flap Test
The proof is in the pudding, they say. After all of this link draining, reconfiguring and undraining,
I gain confidence that this stuff actually works as advertised! I thought it'd be a nice touch to
demonstrate a link drain, between Frankfurt and Amsterdam. I recorded a little asciinema
[[screencast](/assets/vpp-ospf/rollout.cast)], shown here:
{{< image src="/assets/vpp-ospf/rollout.gif" alt="Asciinema" >}}
### Returning IPv4 (and IPv6!) addresses
Now that the backbone links no longer carry global unicast addresses, and they borrow from the one
IPv4 and IPv6 address in `loop0`, I can return a whole stack of addresses:
{{< image src="/assets/vpp-ospf/roi.png" alt="ROI" >}}
In total, I returned 34 IPv4 addresses from IPng's /24, which is 13.3%. This is huge, and I'm
confident that I will find a better use for these little addresses than being pointless
point-to-point links!

View File

@ -0,0 +1,617 @@
---
date: "2024-06-29T06:31:00Z"
title: 'Case Study: IPng at Coloclue'
---
{{< image width="250px" float="right" src="/assets/coloclue-vpp/coloclue_logo2.png" alt="Coloclue" >}}
I have been a member of the Coloclue association in Amsterdam for a long time. This is a networking
association in the social and technical sense of the word. [[Coloclue](https://coloclue.net)] is
based in Amsterdam with members throughout the Netherlands and Europe. Its goals are to facilitate
learning about and operating IP based networks and services. It has about 225 members who, together,
have built this network and deployed about 135 servers across 8 racks in 3 datacenters (Qupra,
EUNetworks and NIKHEF). Coloclue is operating [[AS8283](https://as8283.peeringdb.net/)] across
several local and international internet exchange points.
A small while ago, one of our members, Sebas, shared their setup with the membership. It generated a
bit of a show-and-tell response, with Sebas and other folks on our mailinglist curious as to how we
all deployed our stuff. My buddy Tim pinged me on Telegram: "This is something you should share for
IPng as well!", so this article is a bit different than my usual dabbles. It will be more of a show
and tell: how did I deploy and configure the _Amsterdam Chapter_ of IPng Networks?
I'll make this article a bit more picture-dense, to show the look-and-feel of the equipment.
### Network
{{< image width="350px" float="right" src="/assets/coloclue-ipng/MPLS-Amsterdam.svg" alt="MPLS Ring" >}}
One thing that Coloclue and IPng Networks have in common is that we are networking clubs :) And
readers of my articles may well know that I do so very much like writing about networking. During
the Corona Pandemic, my buddy Fred asked "Hey you have this PI /24, why don't you just announce it
yourself? It'll be fun!" -- and after resisting for a while, I finally decided to go for it. Fred
owns a Swiss ISP called [[IP-Max](https://ip-max.net/)] and he was about to expand into Amsterdam,
and in an epic roadtrip we deployed a point of presence for IPng Networks in each site where IP-Max
has a point of presence.
In Amsterdam, I introduced Fred to Arend Brouwer from [[ERITAP](https://eritap.com/)], and we
deployed our stuff in a brand new rack he had acquired at NIKHEF. It was fun to be in an AirBnB,
drive over to NIKHEF, and together with Arend move in to this completely new and empty rack in one
of the most iconic internet places on the planet. I am deeply grateful for the opportunity.
{{< image width="350px" float="right" src="/assets/coloclue-ipng/staging-ams01.png" alt="Staging" >}}
For IP-Max, this deployment means a Nexus 3068PQ switch and an ASR9001 router, with one 10G
wavelength towards Newtelco in Frankfurt, Germany, and another 10G wavelength towards ETIX in Lille,
France. For IPng it means a Centec S5612X switch, and a Supermicro router. To the right you'll see
the network as it was deployed during that roadtrip - a ring of sites from Zurich, Frankfurt,
Amsterdam, Lille, Paris, Geneva and back to Zurich. They are all identical in terms of hardware.
Pictured to the right is our _staging_ environment, in that AirBnB in Amsterdam: Fred's Nexus and
Cisco ASR9k, two of my Supermicro routers, and an APU which is used as an OOB access point.
### Hardware
Considering Coloclue is a computer and network association, lots of folks are interested in the
physical bits. I'll take some time to detail the hardware that I use for my network, focusing
specifically on the Amsterdam sites.
#### Switches
{{< image width="350px" float="right" src="/assets/coloclue-ipng/centec-stack.png" alt="Centec" >}}
My switches are from a brand called Centec. They make their own switch silicon, and are known to be
very power efficient, affordable cost per port, and what's important for me is that the switches
offer MPLS, VPLS and L2VPN, as well as VxLAN, GENEVE and GRE functionality, all in hardware.
Pictured on the right you can see two main Centec switch types that I use (in the red boxes above
called _IPng Site Local_):
* Centec S5612X: 8x1G RJ45, 8x1G SFP, 12x10G SFP+, 2x40G QSFP+ and 8x25G SFP28
* Centec S5548X: 48x1G RJ45, 2x40G QSFP+ and 4x25G SFP28
* Centec S5624X: 24x10G SFP+ and 2x100G QSFP28
There are also bigger variants, such as the S7548N-8Z switch (48x25G, 8x100G, which is delicious),
or S7800-32Z (32x100G, which is possibly even more yummy). Overall, I have very good experiences
with these switches and the vendor ecosystem around them (optics, patch cables, WDM muxes, and so
on).
Sandwidched between the switches, you'll see some black Supermicro machines with 6x1G, 2x10G SFP+
and 2x 25G SFP28. I'll detail them below, they are IPng's default choice for low-power routers, as
fully loaded they consume about 45W, and can forward 65Gbps and around 35Mpps or so, enough to
kill a horse. And definitely enough for IPng!
{{< image width="350px" float="right" src="/assets/coloclue-ipng/msw0.nlams0.png" alt="msw0.nlams0" >}}
A photo album with a few pictures of the Centec switches, including their innards, lives
[[here](https://photos.app.goo.gl/Mxzs38p355Bo4qZB6)]. In Amsterdam, I have one S5624X, which
connects with 3x10G to IP-Max (one link to Lille, another to Frankfurt, and the third for local
services off of IP-Max's ASR9001). Pictured right is that Centec S5624X MPLS switch at NIKHEF,
called `msw0.nlams0.net.ipng.ch`, which is where my Amsterdam story really begins: _"Goed zat voor
doordeweeks!"_
#### Router
{{< image width="350px" float="right" src="/assets/coloclue-ipng/nlams0-inside.png" alt="nlams0 innards" >}}
The european ring that I built on my roadtrip with Fred consists of identical routers in each
location. I was looking for a machine that has competent out of band operations with IPMI or iDRAC,
could carry 32GB of ECC memory, had at least 4C/8T, and as many 10Gig ports as could realistically
fit. I settled on the Supermicro 5018D-FN8T
[[ref](https://www.supermicro.com/en/products/system/1u/5018/sys-5018d-fn8t.cfm)], because of its
relatively low power CPU (an Intel Xeon D-1518 at 35W TDP), ability to boot off of NVME or mSATA,
PCIe v3.0 x8 expansion port, which carries an additional Intel X710-DA4 quad-tengig port.
I've loadtested these routers extensively while I was working on the Linux Control Plane in VPP, and
I can sustain full port speeds on all six TenGig ports, to a maximum of roughly 35Mpps of IPv4, IPv6
or MPLS traffic. Considering the machine, when fully loaded, will draw about 45 Watts, this is a
very affordable and power efficient router. I love them!
{{< image width="350px" float="right" src="/assets/coloclue-ipng/nlams0.png" alt="nlams0" >}}
The only one thing I'd change , is a second power supply. I personally have never had a PSU fail in
any of the Supermicro routers I operate, but sometimes datacenters do need to take a feed offline,
and that's unfortunate if it causes an interruption. If I'd do it again, I would go for dual PSU,
but I can't complain either, as my router in NIKHEF has been running since 2021 without any power
issues.
I'm a huge fan of Supermicro's IPMI design based on the ASpeed AST2400 BMC; it supports almost all
IPMI features, notably serial-over-LAN, remote port off/cycle, remote KVM over HTML5, remote USB
disk mount, remote install of operating systems, and requires no download of client software. It all
just works in Firefox or Chrome or Safari -- and I've even reinstalled several routers remotely, as
I described in my article [[Debian on IPng's VPP routers]({% post_url 2023-12-17-defra0-debian %})].
There's just something magical about remote-mounting a Debian Bookworm iso image from my workstation
in Br&uuml;ttisellen, Switzerland, in a router running in Amsterdam, to then proceed to use KVM over
HTML5 to reinstall the whole thing remotely. We didn't have that, growing up!!
#### Hypervisors
{{< image width="350px" float="right" src="/assets/coloclue-ipng/hvn0.nlams1.png" alt="hvn0.nlams1" >}}
I host two machines at Coloclue. I started off way back in 2010 or so with one Dell R210-II. That
machine still runs today, albeit in the Telehouse2 datacenter in Paris. At the end of 2022, I made a
trip to Amsterdam to deploy three identical machines, all reasonably spec'd Dell PowerEdge R630
servers:
1. 8x32GB or 256GB Registered/Buffered (ECC) DDR4
1. 2x Intel Xeon E5-2696 v4 (88 CPU threads in total)
1. 1x LSI SAS3008 SAS12 controller
1. 2x 500G Crucial MX500 (TLC) SSD
1. 3x 3840G Seagate ST3840FM0003 (MLC) SAS12
1. Dell rNDC 2x Intel I350 (RJ45), 2x Intel X520 (10G SFP+)
{{< image width="350px" float="right" src="/assets/coloclue-ipng/R630-S5612X.png" alt="Dell+Centec" >}}
Once you go to enterprise storage, you will never want to go back. I take specific care to buy
redundant boot drives, mostly Crucial MX500 because it's TLC flash, and a bit more reliable. However,
the MLC flash from these Seagate and HPE SAS-3 drives (12Gbps bus speeds) are next level. The Seagate
3.84TB drives are in a RAIDZ1 together, and read/write over them is a sustained 2.6GB/s and roughly
380Kops/sec per drive. It really makes the VMs on the hypervisor fly -- and at the same time has a
much, much, much better durability and lifetime. Before I switched to enterprise storage, I would
physically wear out a Samsung consumer SSd in about 12-15mo, and reads/writes would become
unbearably slow over time. With these MLC based drives: no such problem. _Ga hard!_
All hypervisors run Debian Bookworm and have a dedicated iDRAC enterprise port + license. I find the
Supermicro IPMI a little bit easier to work with, but the basic features are supported on the Dell
as well: serial-over-LAN (which comes in super handy at Coloclue) and remote power on/off/cycle, power
metering (using `ipmitool sensors`), and a clunky KVM over HTML if need be.
### Coloclue: Routing
Let's dive into the Coloclue deployment, here's an overview picture, with three colors:
<span style='color:blue;font-weight:bold;'>blue</span> for Coloclue's network components,
<span style='color:red;font-weight:bold;'>red</span> for IPng's internal network, and
<span style='color:orange;font-weight:bold;'>orange</span> for IPng's public network services.
{{< image src="/assets/coloclue-ipng/AS8298-Amsterdam.svg" alt="Amsterdam" >}}
{{< image width="350px" float="right" src="/assets/coloclue-ipng/nikhef.png" alt="NIKHEF" >}}
Coloclue currently has three main locations: Qupra, EUNetworks and NIKHEF. I've drawn the Coloclue
network in <span style='color:blue;font-weight:bold;'>blue</span>. It's pretty impressive, with
a 10G wave between each of the locations. Within the two primary colocation sites (Qupra and
EUNetworks), there are two core switches from Arista, which connect to top-of-rack switches from
FS.com in an MLAG configuration. This means that each TOR is connected redundantly to both core
switches with 10G. The switch in NIKHEF connects to a set of local internet exchanges, and as well to
IP-Max, who deliver a bunch of remote IXPs toIPng'sColoclue, notably DE-CIX, FranceIX, and SwissIX. It is
in NIKHEF where `nikhef-core-1.switch.nl.coloclue.net` (colored <span
style='color:blue;font-weight:bold;'>blue</span>), connects to my my `msw0.nlams0.net.ipng.ch`
switch (in <span style='color:red;font-weight:bold;'>red</span>). IPng Networks' european backbone
then connects from here to Frankfurt and southbound onwards to Zurich, but it also connects from
here to Lille and southbound onwards to Paris.
In the picture to right you can see ERITAP's rack R181 in NIKHEF, when it was ... younger. It did
not take long for many folks to request many cross connects (I myself have already a dozen or so,
and I'm only one customer in this rack!)
One of the advantages of being a launching customer, is that I got to see the rack when it was mostly
empty. Here we can see Coloclue's switch at the top (with the white flat ribbon RJ45 being my
interconnect with it). Then there are two PC Engines APUs, which are IP-Max and IPng's OOB serial
machines. Then comes the ASR9001 called `er01.zrh56.ip-max.net` and under it the Nexus switch that
IP-Max uses for its local customers (including Coloclue and IPng!).
My main router is all the way at the bottom of the picture, called `nlams0.ipng.ch`, one of those
Supermicro D-1518 machines. It is connected with 2x10G in a LAG to the MPLS switch, and then to most
of the available internet exchanges in NIKHEF. I also have two transit providers in Amsterdam:
IP-Max (10Gbit) and A2B Internet (10Gbit).
```
pim@nlams0:~$ birdc show route count
BIRD v2.15.1-4-g280daed5-x ready.
10735329 of 10735329 routes for 969311 networks in table master4
2469998 of 2469998 routes for 203707 networks in table master6
1852412 of 1852412 routes for 463103 networks in table t_roa4
438100 of 438100 routes for 109525 networks in table t_roa6
Total: 15495839 of 15495839 routes for 1745646 networks in 4 tables
```
With the RIB at over 15M entries, I would say this site is very well connected!
### Coloclue: Hypervisors
I have one Dell R630 at Qupra (`hvn0.nlams1.net.ipng.ch`), one at EUNetworks (`hvn0.nlams2.net.ipng.ch`),
and a third one with ERITAP at Equinix AM3 (`hvn0.nlams3.net.ipng.ch`). That last one is connected
with a 10Gbit wavelength to IPng's switch `msw0.nlams0.net.ipng.ch`, and another 10Gbit port to
FrysIX.
{{< image width="350px" float="right" src="/assets/coloclue-ipng/qupra.png" alt="EUNetworks" >}}
Arend and I run a small internet exchange called [[FrysIX](https://ixpmanager.frys-ix.net/)]. I
supply most of the services from that third hypervisor, which has a 10G connection to the local
FrysIX switch at Equinix AM3. More recently, it became possible to request cross connects at Qupra, so
I've put in a request to connect my hypervisor there to FrysIX with 10G as well - this will not be
for peering purposes, but to be able to redundantly connect things like our routeservers,
ixpmanager, librenms, sflow services, and so on. It's nice to be able to have two hypervisors
available, as it makes maintenance just that much easier.
{{< image width="350px" float="right" src="/assets/coloclue-ipng/eunetworks.png" alt="EUNetworks" >}}
Turning my attention to the two hypervisors at Coloclue, one really cool feature that Coloclue
offers, is an L2 VLAN connection from your colo server to the NIKHEF site over our 10G waves between
the datacenters. I requested one of these in each site to NIKHEF using Coloclue's VLAN 402 at Qupra,
and VLAN 412 at EUNetworks. It is over these VLANs that I carry _IPng Site Local_ to the
hypervisors. I showed this in the overview diagram as an <span
style='color:orange;font-weight:bold'>orange</span> dashed line. I bridge Coloclue's VLAN 105 (which
is their eBGP VLAN which has loose uRPF filtering on the Coloclue routers) into a Q-in-Q transport
towards NIKHEF. These two links are colored <span
style='color:purple;font-weight:bold'>purple</span> from EUnetworks and <span
style='color:green;font-weight:bold'>green</span> from Qupra. Finally, I transport my own colocation
VLAN to each site using another Q-in-Q transport with inner VLAN 100.
That may seem overly complicated, so let me describe these one by one:
**1. Colocation Connectivity**:
I will first create a bridge called `coloclue`, which I'll give an MTU of 1500. I will add to that
the port that connects to the Coloclue TOR switch, called `eno4`. However, I will give that port an
MTU of 9216 as I will support jumbo frames on other VLANs later.
```
pim@hvn0-nlams1:~$ sudo ip link add coloclue type bridge
pim@hvn0-nlams1:~$ sudo ip link set coloclue mtu 1500 up
pim@hvn0-nlams1:~$ sudo ip link set eno4 mtu 9216 master coloclue up
pim@hvn0-nlams1:~$ sudo ip addr add 94.142.244.54/24 dev coloclue
pim@hvn0-nlams1:~$ sudo ip addr add 2a02:898::146:1/64 dev coloclue
pim@hvn0-nlams1:~$ sudo ip route add default via 94.142.244.254
pim@hvn0-nlams1:~$ sudo ip route add default via 2a02:898::1
```
**2. IPng Site Local**:
All hypervisors at IPng are connected to a private network called _IPng Site Local_ with IPv4
addresses from `198.19.0.0/16` and IPv6 addresses from `2001:678:d78:500::/56`, both of which are
not routed on the public Internet. I will give the hypervisor an address and a route towards IPng
Site local like so:
```
pim@hvn0-nlams1:~$ sudo ip link add ipng-sl type bridge
pim@hvn0-nlams1:~$ sudo ip link set ipng-sl mtu 9000 up
pim@hvn0-nlams1:~$ sudo ip link add link eno4 name eno4.402 type vlan id 402
pim@hvn0-nlams1:~$ sudo ip link set link eno4.402 mtu 9216 master ipng-sl up
pim@hvn0-nlams1:~$ sudo ip addr add 198.19.4.194/27 dev ipng-sl
pim@hvn0-nlams1:~$ sudo ip addr add 2001:678:d78:509::2/64 dev ipng-sl
pim@hvn0-nlams1:~$ sudo ip route add 198.19.0.0/16 via 198.19.4.193
pim@hvn0-nlams1:~$ sudo ip route add 2001:678:d78:500::/56 via 2001:678:d78:509::1
```
Note the MTU here. While the hypervisor is connected via 1500 bytes to the Coloclue network, it is
connected with 9000 bytes to IPng Site local. On the other side of VLAN 402 lives the Centec switch,
which is configured simply with a VLAN interface:
```
interface vlan402
description Infra: IPng Site Local (Qupra)
mtu 9000
ip address 198.19.4.193/27
ipv6 address 2001:678:d78:509::1/64
!
interface vlan301
description Core: msw0.defra0.net.ipng.ch
mtu 9028
label-switching
ip address 198.19.2.13/31
ipv6 address 2001:678:d78:501::6:2/112
ip ospf network point-to-point
ip ospf cost 73
ipv6 ospf network point-to-point
ipv6 ospf cost 73
ipv6 router ospf area 0
enable-ldp
!
interface vlan303
description Core: msw0.frggh0.net.ipng.ch
mtu 9028
label-switching
ip address 198.19.2.24/31
ipv6 address 2001:678:d78:501::c:1/112
ip ospf network point-to-point
ip ospf cost 85
ipv6 ospf network point-to-point
ipv6 ospf cost 85
ipv6 router ospf area 0
enable-ldp
```
There are two other interfaces here: `vlan301` towards the MPLS switch in Frankfurt Equinix FR5 and
`vlan303` towards the MPLS switch in Lille ETIX#2. I've configured those to enable OSPF, LDP and
MPLS forwarding. As such, this network with `hvn0.nlams1.net.ipng.ch` becomes a leaf node with a /27 and
/64 in IPng Site Local, in which I can run virtual machines and stuff.
Traceroutes on this private underlay network are very pretty, using the `net.ipng.ch` domain, and
entirely using silicon-based wirespeed routers with IPv4, IPv6 and MPLS and jumbo frames, never
hitting the public Internet:
```
pim@hvn0-nlams1:~$ traceroute6 squanchy.net.ipng.ch 9000
traceroute to squanchy.net.ipng.ch (2001:678:d78:503::4), 30 hops max, 9000 byte packets
1 msw0.nlams0.net.ipng.ch (2001:678:d78:509::1) 1.116 ms 1.720 ms 2.369 ms
2 msw0.defra0.net.ipng.ch (2001:678:d78:501::6:1) 7.804 ms 7.812 ms 7.823 ms
3 msw0.chrma0.net.ipng.ch (2001:678:d78:501::5:1) 12.839 ms 13.498 ms 14.138 ms
4 msw1.chrma0.net.ipng.ch (2001:678:d78:501::11:2) 12.686 ms 13.363 ms 13.951 ms
5 msw0.chbtl0.net.ipng.ch (2001:678:d78:501::1) 13.446 ms 13.523 ms 13.683 ms
6 squanchy.net.ipng.ch (2001:678:d78:503::4) 12.890 ms 12.751 ms 12.767 ms
```
**3. Coloclue BGP uplink**:
I make use of the IP transit offering of Coloclue. Coloclue has four routers in total: two in
EUNetworks and two in Qupra, which I'll show the configuration for here. I don't take the transit
session on the hypervisor, but rather I forward the traffic Layer2 to my VPP router called
`nlams0.ipng.ch` over VLAN 402 <span style='color:purple;font-weight:bold'>purple</span> and VLAN
412 <span style='color:green;font-weight:bold'>green</span> VLANs to NIKHEF. I'll show the
configuration for Qupra (VLAN 402) first:
```
pim@hvn0-nlams1:~$ sudo ip link add coloclue-bgp type bridge
pim@hvn0-nlams1:~$ sudo ip link set coloclue-bgp mtu 1500 up
pim@hvn0-nlams1:~$ sudo ip link add link eno4 name eno4.105 type vlan id 105
pim@hvn0-nlams1:~$ sudo ip link add link eno4.402 name eno4.402.105 type vlan id 105
pim@hvn0-nlams1:~$ sudo ip link set eno4.105 mtu 1500 master coloclue-bgp up
pim@hvn0-nlams1:~$ sudo ip link set eno4.402.105 mtu 1500 master coloclue-bgp up
```
These VLANs terminate on `msw0.nlams0.net.ipng.ch` where I just offer them directly to the VPP
router:
```
interface eth-0-2
description Infra: nikhef-core-1.switch.nl.coloclue.net e1/34
switchport mode trunk
switchport trunk allowed vlan add 402,412
switchport trunk allowed vlan remove 1
lldp disable
!
interface eth-0-3
description Infra: nlams0.ipng.ch:Gi8/0/0
switchport mode trunk
switchport trunk allowed vlan add 402,412
switchport trunk allowed vlan remove 1
```
**4. IPng Services VLANs**:
I have one more thing to share. Up until now, the hypervisor has internal connectivity to _IPng Site
Local_, and a single IPv4 / IPv6 address in the shared colocation network. Almost all VMs at IPng
run entirely in IPng Site Local, and will use reversed proxies and other tricks to expose themselves
to the internet. But, I also use a modest amount of IPv4 and IPv6 addresses on the VMs here, for
example for those NGINX reversed proxies [[ref]({% post_url 2023-03-17-ipng-frontends %})], or my
SMTP relays [[ref]({% post_url 2024-05-17-smtp %})].
For this purpose, I will need to plumb through some form of colocation VLAN in each site, which
looks very similar to the BGP uplink VLAN I described previously:
```
pim@hvn0-nlams1:~$ sudo ip link add ipng type bridge
pim@hvn0-nlams1:~$ sudo ip link set ipng mtu 9000 up
pim@hvn0-nlams1:~$ sudo ip link add link eno4 name eno4.100 type vlan id 100
pim@hvn0-nlams1:~$ sudo ip link add link eno4.402 name eno4.402.100 type vlan id 100
pim@hvn0-nlams1:~$ sudo ip link set eno4.100 mtu 9000 master ipng up
pim@hvn0-nlams1:~$ sudo ip link set eno4.402.100 mtu 9000 master ipng up
```
Looking at the VPP router, it picks up these two VLANs 402 and 412, which are used for _IPng Site
Local_. On top of those, the router will add two Q-in-Q VLANs: `402.105` will be the BGP uplink, and
Q-in-Q `402.100` will be the IPv4 space assigned to IPng:
```
interfaces:
GigabitEthernet8/0/0:
device-type: dpdk
description: 'Infra: msw0.nlams0.ipng.ch:eth-0-3'
lcp: e0-1
mac: '3c:ec:ef:46:65:97'
mtu: 9216
sub-interfaces:
402:
description: 'Infra: VLAN to Qupra'
lcp: e0-0.402
mtu: 9000
412:
description: 'Infra: VLAN to EUNetworks'
lcp: e0-0.412
mtu: 9000
402100:
description: 'Infra: hvn0.nlams1.ipng.ch'
addresses: ['94.142.241.184/32', '2a02:898:146::1/64']
lcp: e0-0.402.100
mtu: 9000
encapsulation:
dot1q: 402
exact-match: True
inner-dot1q: 100
402105:
description: 'Transit: Coloclue (urpf-shared-vlan Qupra)'
addresses: ['185.52.225.34/28', '2a02:898:0:1::146:1/64']
lcp: e0-0.402.105
mtu: 1500
encapsulation:
dot1q: 402
exact-match: True
inner-dot1q: 105
```
Using BGP, my AS8298 will announce my own prefixes and two /29s that I have assigned to me from
Coloclue. One of them is `94.142.241.184/29` in Qupra, and the other is `94.142.245.80/29` in
EUNetworks. But, I don't like wasting IP space, so I assign only the first /32 from that range to
the interface, and use Bird2 to set a route for the other 7 addresses into the interface, which will
allow me to use all eight addresses!
```
pim@border0-nlams3:~$ traceroute nginx0.nlams1.ipng.ch
traceroute to nginx0.nlams1.ipng.ch (94.142.241.189), 30 hops max, 60 byte packets
1 ipmax.nlams0.ipng.ch (46.20.243.177) 1.190 ms 1.102 ms 1.101 ms
2 speed-ix.coloclue.net (185.1.222.16) 0.448 ms 0.405 ms 0.361 ms
3 nlams0.ipng.ch (185.52.225.34) 0.461 ms 0.461 ms 0.382 ms
4 nginx0.nlams1.ipng.ch (94.142.241.189) 1.084 ms 1.042 ms 1.004 ms
pim@border0-nlams3:~$ traceroute smtp.nlams2.ipng.ch
traceroute to smtp.nlams2.ipng.ch (94.142.245.85), 30 hops max, 60 byte packets
1 ipmax.nlams0.ipng.ch (46.20.243.177) 2.842 ms 2.743 ms 3.264 ms
2 speed-ix.coloclue.net (185.1.222.16) 0.383 ms 0.338 ms 0.338 ms
3 nlams0.ipng.ch (185.52.225.34) 0.372 ms 0.365 ms 0.304 ms
4 smtp.nlams2.ipng.ch (94.142.245.85) 1.042 ms 1.000 ms 0.959 ms
```
### Coloclue: Services
I run a bunch of services on these hypervisors. Some are for me personally, or for my company IPng
Networks GmbH, and some are for community projects. Let me list a few things here:
**AS112 Services** \
I run an anycasted AS112 cluster in all sites where IPng has hypervisor capacity. Notably in
Amsterdam, my nodes are running on both Qupra and EUNetworks, and connect to LSIX, SpeedIX, FogIXP,
FrysIX and behind AS8283 and AS8298. The nodes here handle roughly 5kqps at peak, and if RIPE NCC's
node in Amsterdam goes down, this can go up to 13kqps (right, WEiRD?). I described the setup in an
[[article]({% post_url 2021-06-28-as112 %})]. You may be wondering: how do I get those internet
exchanges backhauled to a VM at Coloclue? The answer is: VxLAN transport! Here's a relevant snippet
from the `nlams0.ipng.ch` router config:
```
vxlan_tunnels:
vxlan_tunnel1:
local: 94.142.241.184
remote: 94.142.241.187
vni: 11201
interfaces:
TenGigabitEthernet4/0/0:
device-type: dpdk
description: 'Infra: msw0.nlams0:eth-0-9'
lcp: xe0-0
mac: '3c:ec:ef:46:68:a8'
mtu: 9216
sub-interfaces:
112:
description: 'Peering: LSIX for AS112'
l2xc: vxlan_tunnel1
mtu: 1522
vxlan_tunnel1:
description: 'Infra: AS112 LSIX'
l2xc: TenGigabitEthernet4/0/0.112
mtu: 1522
```
And the Centec switch config:
```
vlan database
vlan 112 name v-lsix-as112 mac learning disable
interface eth-0-5
description Infra: LSIX AS112
switchport access vlan 112
interface eth-0-9
description Infra: nlams0.ipng.ch:Te4/0/0
switchport mode trunk
switchport trunk allowed vlan add 100,101,110-112,302,311,312,501-503,2604
switchport trunk allowed vlan remove 1
```
What happens is: LSIX connects the AS112 port to the Centec switch on `eth-0-5`, which offers it
tagged to `Te4/0/0.112` on the VPP router and without wasting CAM space for the MAC addresses (by
turing off MAC learning -- this is possible because there's only 2 ports in the VLAN, so the switch
implicitly always knows where to forward the frames!).
After sending it out on `eth-0-9` tagged as VLAN 112, VPP in turn encapsulates it with VxLAN and sends
it as VNI 11201 to remote endpoint `94.142.241.187`. Because that path has an MTU of 9000, the
traffic arrives to the VM with 1500b, no worries. Most of my AS112 traffic arrives to a VM this
way, as it's really easy to flip the remote endpoint of the VxLAN tunnel to anothe replica in case
of an outage or maintenance. Typically, BGP sessions won't even notice.
**NGINX Frontends** \
At IPng, almost everything runs in the internal network called _IPng Site Local_. I expose this
network via a few carefully placed NGINX frontends. There are two in my own network (in Geneva and
Zurich), and one in IP-Max's network (in Zurich), and two at Coloclue (in Amsterdam). They frontend
and do SSL offloading and TCP loadbalancing for a variety of websites and services. I described the
architecture and design in an [[article]({% post_url 2023-03-17-ipng-frontends %})]. There are
currently ~120 or so websites frontended on this cluster.
**SMTP Relays** \
I self-host my mail, and I tried to make a fully redundant and self-repairing SMTP in- and outbound
with Postfix, IMAP server and redundant maildrop storage with Dovecot, a webmail service with
Roundcube, and so on. Because I need to perform DNSBL lookups, this requires routable IPv4 and IPv6
addresses. Two of my four mailservers run at Coloclue, which I described in an [[article]({%
post_url 2024-05-17-smtp %})].
**Mailman Service** \
For FrysIX, FreeIX, and IPng itself, I run a set of mailing lists. The mailman service runs
partially in IPng Site Local, and has one IPv4 address for outbound e-mail. I separated this from
the IPng relays so that IP based reputation does not interfere between these two types of
mailservice.
**FrysIX Services** \
The routeserver `rs2.frys-ix.net`, the authoritative nameserver `ns2.frys-ix.net`, the IXPManager
and LibreNMS monitoring service all run on hypervisors at either Coloclue (Qupra) or ERITAP
(Equinix AM3). By the way, remember the part about the enterprise storage? The ixpmanager is
currently running on `hvn0.nlams3.net.ipng.ch` which has a set of three Samsung EVO consumer SSDs,
which are really at the end of their life. Please, can I connect to FrysIX from Qupra so I can move
these VMs to the Seagate SAS-3 MLC storage pool? :)
{{< image width="100px" float="right" src="/assets/coloclue-ipng/pencilvester.png" alt="Pencilvester" >}}
**IPng OpenBSD Bastion Hosts** \
IPng Networks has three OpenBSD bastion jumphosts with an open SSH port 22, which are named after
characters from a TV show called Rick and Morty. **Squanchy** lives in my house on
`hvn0.chbtl0.net.ipng.ch`, **Glootie** lives at IP-Max on `hvn0.chrma0.net.ipng.ch`, and **Pencilvester**
lives on a hypervisor at Coloclue on `hvn0.nlams1.net.ipng.ch`. These bastion hosts connect both to the public
internet, but also to the _IPng Site Local_ network. As such, if I have SSH access, I will also have
access to the internal network of IPng.
**IPng Border Gateways** \
The internal network of IPng is mostly disconnected from the Internet. Although I can log in via these
bastion hosts, I also have a set of four so-called _Border Gateways_, which are also connected both
to the _IPng Site Local_ network, but also to the Internet. Each of them runs an IPv4 and IPv6
WireGuard endpoint, and I'm pretty much always connected with these. It allows me full access to the
internal network, and NAT'ed towards the Internet.
Each border gateway announces a default route towards the Centec switches, and connect to AS8298,
AS8283 and AS25091 for internet connectivity. One of them runs in Amsterdam, and I wrote about
these gateways in an [[article]({% post_url 2023-03-11-mpls-core %})].
**Public NAT64/DNS64 Gateways** \
I operate a set of four private NAT64/DNS64 gateways, one of which in Amsterdam. It pairs up and
complements the WireGuard and NAT44/NAT66 functionality of the _Border Gateways_. Because NAT64 is
useful in general, I also operate two public NAT64/DNS64 gateways, one at Qupra and one at
EUNetworks. You can try them for yourself by using the following anycasted resolver:
`2a02:898:146:64::64` and performing a traceroute to an IPv4 only host, like `github.com`. Note:
this works from anywhere, but for satefy reasons, I filter some ports like SMTP, NETBIOS and so on,
roughly the same way a TOR exit router would. I wrote about them in an [[article]({% post_url
2024-05-25-nat64-1 %})].
```
pim@cons0-nlams0:~$ cat /etc/resolv.conf
# *** Managed by IPng Ansible ***
#
domain ipng.ch
search net.ipng.ch ipng.ch
nameserver 2a02:898:146:64::64
pim@cons0-nlams0:~$ traceroute6 -q1 ipv4.tlund.se
traceroute to ipv4c.tlund.se (2a02:898:146:64::c10f:e4c3), 30 hops max, 80 byte packets
1 2a10:e300:26:48::1 (2a10:e300:26:48::1) 0.221 ms
2 as8283.ix.frl (2001:7f8:10f::205b:187) 0.443 ms
3 hvn0.nlams1.ipng.ch (2a02:898::146:1) 0.866 ms
4 bond0-100.dc5-1.router.nl.coloclue.net (2a02:898:146:64::5e8e:f4fc) 0.900 ms
5 bond0-130.eunetworks-2.router.nl.coloclue.net (2a02:898:146:64::5e8e:f7f2) 0.920 ms
6 ams13-peer-1.hundredgige2-3-0.tele2.net (2a02:898:146:64::50f9:d18b) 2.302 ms
7 ams13-agg-1.bundle-ether4.tele2.net (2a02:898:146:64::5b81:e1e) 22.760 ms
8 gbg-cagg-1.bundle-ether7.tele2.net (2a02:898:146:64::5b81:ef8) 22.983 ms
9 bck3-core-1.bundle-ether6.tele2.net (2a02:898:146:64::5b81:c74) 22.295 ms
10 lba5-core-2.bundle-ether2.tele2.net (2a02:898:146:64::5b81:c2f) 21.951 ms
11 avk-core-2.bundle-ether9.tele2.net (2a02:898:146:64::5b81:c24) 21.760 ms
12 avk-cagg-1.bundle-ether4.tele2.net (2a02:898:146:64::5b81:c0d) 22.602 ms
13 skst123-lgw-2.bundle-ether50.tele2.net (2a02:898:146:64::5b81:e23) 21.553 ms
14 skst123-pe-1.gigabiteth0-2.tele2.net (2a02:898:146:64::82f4:5045) 21.336 ms
15 2a02:898:146:64::c10f:e4c3 (2a02:898:146:64::c10f:e4c3) 21.722 ms
```
### Thanks for reading
{{< image width="150px" float="left" src="/assets/coloclue-ipng/heart.png" alt="Heart" >}}
This article is a bit different to my usual writing - it doesn't deep dive into any protocol or code
that I've written, but it does describe a good chunk of the way I think about systems and
networking. I appreciate the opportunities that Coloclue as a networking community and hobby club
affords. I'm always happy to talk about routing, network- and systems engineering, and the stuff I
develop at IPng Networks, notably our VPP routing stack. I encourage folks to become a member and
learn about develop novel approaches to this thing we call the Internet.
Oh, and if you're a Coloclue member looking for a secondary location, IPng offers colocation and
hosting services in Zurich, Geneva, and soon in Lucerne as well :) Houdoe!

View File

@ -0,0 +1,566 @@
---
date: "2024-07-05T12:51:23Z"
title: 'Review: R86S (Jasper Lake - N6005)'
---
# Introduction
{{< image width="250px" float="right" src="/assets/r86s/r86s-front.png" alt="R86S Front" >}}
I am always interested in finding new hardware that is capable of running VPP. Of course, a standard
issue 19" rack mountable machine like a Dell, HPE or SuperMicro machine is an obvious choice. They
come with redundant power supplies, PCIe v3.0 or better expansion slots, and can boot off of mSATA
or NVME, with plenty of RAM. But for some people and in some locations, the power envelope or
size/cost of these 19" rack mountable machines can be prohibitive. Sometimes, just having a smaller
form factor can be very useful: \
***Enter the GoWin R86S!***
{{< image width="250px" float="right" src="/assets/r86s/r86s-nvme.png" alt="R86S NVME" >}}
I stumbled across this lesser known build from GoWin, which is an ultra compact but modern design,
featuring three 2.5GbE ethernet ports and optionally two 10GbE, or as I'll show here, two 25GbE
ports. What I really liked about the machine is that it comes with 32GB of LPDDR4 memory and can boot
off of an m.2 NVME -- which makes it immediately an appealing device to put in the field. I
noticed that the height of the machine is just a few millimeters smaller than 1U which is 1.75"
(44.5mm), which gives me the bright idea to 3D print a bracket to be able to rack these and because
they are very compact -- a width of 78mm only, I can manage to fit four of them in one 1U front, or
maybe a Mikrotik CRS305 breakout switch. Slick!
{{< image width="250px" float="right" src="/assets/r86s/r86s-ocp.png" alt="R86S OCP" >}}
I picked up two of these _R86S Pro_ and when they arrived, I noticed that their 10GbE is actually
an _Open Compute Project_ (OCP) footprint expansion card, which struck me as clever. It means that I
can replace the Mellanox `CX342A` network card with perhaps something more modern, such as an Intel
`X520-DA2` or Mellanox `MCX542B_ACAN` which is even dual-25G! So I take to ebay and buy myself a few
expansion OCP boards, which are surprisingly cheap, perhaps because the OCP form factor isn't as
popular as 'normal' PCIe v3.0 cards.
I put a Google photos album online [[here](https://photos.app.goo.gl/gPMAp21FcXFiuNaH7)], in case
you'd like some more detailed shots.
In this article, I'll write about a mixture of hardware, systems engineering (how the hardware like
network cards and motherboard and CPU interact with one another), and VPP performance diagnostics.
I hope that it helps a few wary Internet denizens feel their way around these challenging but
otherwise fascinating technical topics. Ready? Let's go!
# Hardware Specs
{{< image width="250px" float="right" src="/assets/r86s/nics.png" alt="NICs" >}}
For the CHF 314,- I paid for each Intel Pentium N6005, this machine is delightful! They feature:
* Intel Pentium Silver N6005 @ 2.00GHz (4 cores)
* 2x16GB Micron LPDDR4 memory @2933MT/s
* 1x Samsung SSD 980 PRO 1TB NVME
* 3x Intel I226-V 2.5GbE network ports
* 1x OCP v2.0 connector with PCIe v3.0 x4 delivered
* USB-C power supply
* 2x USB3 (one on front, one on side)
* 1x USB2 (on the side)
* 1x MicroSD slot
* 1x MicroHDMI video out
* Wi-Fi 6 AX201 160MHz onboard
To the right I've put the three OCP nework interface cards side by side. On the top, the Mellanox
Cx3 (2x10G) that shipped with the R86S units. In the middle, a spiffy Mellanox Cx5 (2x25G), and at
the bottom, the _classic_ Intel 82599ES (2x10G) card. As I'll demonstrate, despite having the same
form factor, each of these have a unique story to tell, well beyond their rated portspeed.
There's quite a few options for CPU out there - GoWin sells them with Jasper Lake (Celeron N5105
or Pentium N6005, the one I bought), but also with newer Alder Lake (N100 or N305). Price,
performance and power draw will vary. I looked at a few differences in Passmark, and I think I made
a good trade off between cost, power and performance. You may of course choose differently!
The R86S formfactor is very compact, coming in at (80mm x 120mm x 40mm), and the case is made of
sturdy aluminium. It feels like a good quality build, and the inside is also pretty neat. In the
kit, a cute little M2 hex driver is included. This allows me to remove the bottom plate (to service
the NVME) and separate the case to access the OCP connector (and replace the NIC!). Finally, the two
antennae at the back are tri-band, suitable for WiFi 6. There is one fan included in the chassis,
with a few cut-outs in the top of the case, to let the air flow through the case. The fan is not
noisy, but definitely noticeable.
## Compiling VPP on R86S
I first install Debian Bookworm on them, and retrofit one of them with the Intel X520 and the other
with the Mellanox Cx5 network cards. While the Mellanox Cx342A that comes with the R86S does have
DPDK support (using the MLX4 poll mode driver), it has a quirk in that it does not enumerate both
ports as unique PCI devices, causing VPP to crash with duplicate graph node names:
```
vlib_register_node:418: more than one node named `FiftySixGigabitEthernet5/0/0-tx'
Failed to save post-mortem API trace to /tmp/api_post_mortem.794
received signal SIGABRT, PC 0x7f9445aa9e2c
```
The way VPP enumerates DPDK devices is by walking the PCI bus, but considering the Connect-X3 has
two ports behind the same PCI address, it'll try to create two interfaces, which fails. It's pretty
easily fixable with a small [[patch](/assets/r86s/vpp-cx3.patch)]. Off I go, to compile VPP (version
`24.10-rc0~88-ge3469369dd`) with Mellanox DPDK support, to get the best side by side comparison
between the Cx3 and X520 cards on the one hand needing DPDK, and the Cx5 card optionally also being
able to use VPP's RDMA driver. They will _all_ be using DPDK in my tests.
I'm not out of the woods yet, because VPP throws an error when enumerating and attaching the
Mellanox Cx342. I read the DPDK documentation for this poll mode driver
[[ref](https://doc.dpdk.org/guides/nics/mlx4.html)] and find that when using DPDK applications, the
`mlx4_core` driver in the kernel has to be initialized with a specific flag, like so:
```
GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=1-3 iommu=on intel_iommu=on mlx4_core.log_num_mgm_entry_size=-1"
```
And because I'm using `iommu`, the correct driver to load for Cx3 is `vfio_pci`, so I put that in
`/etc/modules`, rebuild the initrd, and reboot the machine. With all of that sleuthing out of the
way, I am now ready to take the R86S out for a spin and see how much this little machine is capable
of forwarding as a router.
### Power: Idle and Under Load
I note that the Intel Pentium Silver CPU has 4 cores, one of which will be used by OS and
controlplane, leaving 3 worker threads left for VPP. The Pentium Silver N6005 comes with 32kB of L1
per core, and 1.5MB of L2 + 4MB of L3 cache shared between the cores. It's not much, but then again
the TDP is shockingly low 10 Watts. Before VPP runs (and makes the CPUs work really hard), the
entire machine idles at 12 Watts. When powered on under full load, the Mellanox Cx3 and Intel
x520-DA2 both sip 17 Watts of power and the Mellanox Cx5 slurps 20 Watts of power all-up. Neat!
## Loadtest Results
{{< image width="400px" float="right" src="/assets/r86s/loadtest.png" alt="Loadtest" >}}
For each network interface I will do a bunch of loadtests, to show different aspects of the setup.
First, I'll do a bunch of unidirectional tests, where traffic goes into one port and exits another.
I will do this with either large packets (1514b), small packets (64b) but many flows, which allow me
to use multiple hardware receive queues assigned to individual worker threads, or small packets with
only one flow, limiting VPP to only one RX queue and consequently only one CPU thread. Because I
think it's hella cool, I will also loadtest MPLS label switching (eg. MPLS frame with label '16' on
ingress, forwarded with a swapped label '17' on egress). In general, MPLS lookups can be a bit
faster as they are (constant time) hashtable lookups, while IPv4 longest prefix match lookups use a
trie. MPLS won't be significantly faster than IPv4 in these tests, because the FIB is tiny with
only a handful of entries.
Second, I'll do the same loadtests but in both directions, which means traffic is both entering NIC0
and being emitted on NIC1, but also entering on NIC1 to be emitted on NIC0. In these loadtests,
again large packets, small packets multi-flow, small packets single-flow, and MPLS, the network chip
has to do more work to maintain its RX queues *and* its TX queues simultaneously. As I'll
demonstrate, this tends to matter quite a bit on consumer hardware.
### Intel i226-V (2.5GbE)
This is a 2.5G network interface from the _Foxville_ family, released in Q2 2022 with a ten year
expected availability, it's currently a very good choice. It is a consumer/client chip, which means
I cannot expect super performance from it. In this machine, the three RJ45 ports are connected to
PCI slot 01:00.0, 02:00.0 and 03:00.0, each at 5.0GT/s (this means they are PCIe v2.0) and they take
one x1 PCIe lane to the CPU. I leave the first port as management, and take the second+third one and
give them to VPP like so:
```
dpdk {
dev 0000:02:00.0 { name e0 }
dev 0000:03:00.0 { name e1 }
no-multi-seg
decimal-interface-names
uio-driver vfio-pci
}
```
The logical configuration then becomes:
```
set int state e0 up
set int state e1 up
set int ip address e0 100.64.1.1/30
set int ip address e1 100.64.2.1/30
ip route add 16.0.0.0/24 via 100.64.1.2
ip route add 48.0.0.0/24 via 100.64.2.2
ip neighbor e0 100.64.1.2 50:7c:6f:20:30:70
ip neighbor e1 100.64.2.2 50:7c:6f:20:30:71
mpls table add 0
set interface mpls e0 enable
set interface mpls e1 enable
mpls local-label add 16 eos via 100.64.2.2 e1
mpls local-label add 17 eos via 100.64.1.2 e0
```
In the first block, I'll bring up interfaces `e0` and `e1`, give them an IPv4 address in a /30
transit net, and set a route to the other side. I'll route packets destined to 16.0.0.0/24 to the
Cisco T-Rex loadtester at 100.64.1.2, and I'll route packets for 48.0.0.0/24 to the T-Rex at
100.64.2.2. To avoid the need to ARP for T-Rex, I'll set some static ARP entries to the loadtester's
MAC addresses.
In the second block, I'll enable MPLS, turn it on on the two interfaces, and add two FIB entries. If
VPP receives an MPLS packet with label 16, it'll forward it on to Cisco T-Rex on port `e1`, and if it
receives a packet with label 17, it'll forward it to T-Rex on port `e0`.
Without further ado, here are the results of the i226-V loadtest:
| ***Intel i226-V***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|---------------|----------|-------------|-----------|
| Unidirectional 1514b | 2.44Gbps | 202kpps | 99.4% |
| Unidirectional 64b Multi | 1.58Gbps | 3.28Mpps | 88.1% |
| Unidirectional 64b Single | 1.58Gbps | 3.28Mpps | 88.1% |
| Unidirectional 64b MPLS | 1.57Gbps | 3.27Mpps | 87.9% |
| Bidirectional 1514b | 4.84Gbps | 404kpps | 99.4% |
| Bidirectional 64b Multi | 2.44Gbps | 5.07Mpps | 68.2% |
| Bidirectional 64b Single | 2.44Gbps | 5.07Mpps | 68.2% |
| Bidirectional 64b MPLS | 2.43Gbps | 5.07Mpps | 68.2% |
First response: very respectable!
#### Important Notes
**1. L1 vs L2** \
There's a few observations I want to make, as these numbers can be confusing. First off, VPP when
given large packets, can easily sustain almost exactly (!) the line rate of 2.5GbE. There's always a
debate about these numbers, so let me offer offer some theoretical background --
1. The L2 Ethernet frame that Cisco T-Rex sends consists of the source/destination MAC (6
bytes each), a type (2 bytes), the payload, and a frame checksum (4 bytes). It shows us this
number as `Tx bps L2`.
1. But on the wire, the PHY has to additionally send a _preamble_ (7 bytes), a _start frame
delimiter_ (1 byte), and at the end, an _interpacket gap_ (12 bytes), which is 20 bytes of
overhead. This means that the total size on the wire will be **1534 bytes**. It shows us this
number as `Tx bps L1`.
1. This 1534 byte L1 frame on the wire is 12272 bits. For a 2.5Gigabit line rate, this means we
can send at most 2'500'000'000 / 12272 = **203715 packets per second**. Regardless of L1 or L2,
this number is always `Tx pps`.
1. The smallest (L2) Ethernet frame we're allowed to send, is 64 bytes, and anything shorter than
this is called a _Runt_. On the wire, such a frame will be 84 bytes (672 bits). With 2.5GbE, this
means **3.72Mpps** is the theoretical maximum.
When reading back loadtest results from Cisco T-Rex, it shows us packets per second (Rx pps), but it
only shows us the `Rx bps`, which is the **L2 bits/sec** which corresponds to the sending port's `Tx
bps L2`. When I describe the percentage of Line-Rate, I calculate this with what physically fits on
the wire, eg the **L1 bits/sec**, because that makes most sense to me.
When sending small 64b packets, the difference is significant: taking the above _Unidirectional 64b
Single_ as an example, I observed 3.28M packets/sec. This is a bandwidth of 3.28M\*64\*8 = 1.679Gbit
of L2 traffic, but a bandwidth of 3.28M\*(64+20)\*8 = 2.204Gbit of L1 traffic, which is how I
determine that it is 88.1% of Line-Rate.
**2. One RX queue** \
A less pedantic observation is that there is no difference between _Multi_ and _Single_ flow
loadtests. This is because the NIC only uses one RX queue, and therefor only one VPP worker thread.
I did do a few loadtests with multiple receive queues, but it does not matter for performance. When
performing this 3.28Mpps of load, I can see that VPP itself is not saturated. I can see that most of
the time it's just sitting there waiting for DPDK to give it work, which manifests as a vectors/call
relatively low:
```
---------------
Thread 2 vpp_wk_1 (lcore 2)
Time 10.9, 10 sec internal node vector rate 40.39 loops/sec 68325.87
vector rates in 3.2814e6, out 3.2814e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
dpdk-input polling 61933 2846586 0 1.28e2 45.96
ethernet-input active 61733 2846586 0 1.71e2 46.11
ip4-input-no-checksum active 61733 2846586 0 6.54e1 46.11
ip4-load-balance active 61733 2846586 0 4.70e1 46.11
ip4-lookup active 61733 2846586 0 7.50e1 46.11
ip4-rewrite active 61733 2846586 0 7.23e1 46.11
e1-output active 61733 2846586 0 2.53e1 46.11
e1-tx active 61733 2846586 0 1.38e2 46.11
```
By the way the other numbers here are fascinating as well. Take a look at them:
* **Calls**: How often has VPP executed this graph node.
* **Vectors**: How many packets (which are internally called vectors) have been handled.
* **Vectors/Call**: Every time VPP executes the graph node, on average how many packets are done
at once? An unloaded VPP will hover around 1.00, and the maximum permissible is 256.00.
* **Clocks**: How many CPU cycles, on average, did each packet spend in each graph node.
Interestingly, summing up this number gets very close to the total CPU clock cycles available
(on this machine 2.4GHz).
Zooming in on the **clocks** number a bit more: every time a packet was handled, roughly 594 CPU
cycles were spent in VPP's directed graph. An additional 128 CPU cycles were spent asking DPDK for
work. Summing it all up, 3.28M\*(594+128) = 2'369'170'800 which is earily close to the 2.4GHz I
mentioned above. I love it when the math checks out!!
By the way, in case you were wondering what happens on an unloaded VPP thread, the clocks spent
in `dpdk-input` (and other polling nodes like `unix-epoll-input`) just go up to consume the whole
core. I explain that in a bit more detail below.
**3. Uni- vs Bidirectional** \
I noticed a non-linear response between loadtests in one direction versus both directions. At large
packets, it did not matter. Both directions satured the line nearly perfectly (202kpps in one
direction, and 404kpps in both directions). However, in the smaller packets, some contention became
clear. In only one direction, IPv4 and MPLS forwarding were roughly 3.28Mpps; but in both
directions, this went down to 2.53Mpps in each direction (which is my reported 5.07Mpps). So it's
interesting to see how these i226-V chips do seem to care if they are only receiving or transmitting
transmitting, or performing both receiving *and* transmitting.
### Intel X520 (10GbE)
{{< image width="100px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
This network card is based on the classic Intel _Niantic_ chipset, also known as the 82599ES chip,
first released in 2009. It's super reliable, but there is one downside. It's a PCIe v2.0 device
(5.0GT/s) and to be able to run two ports, it will need eight lanes of PCI connectivity. However, a
quick inspection using `dmesg` shows me, that there are only 4 lanes brought to the OCP connector:
```
ixgbe 0000:05:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x4 link
at 0000:00:1c.4 (capable of 32.000 Gb/s with 5.0 GT/s PCIe x8 link)
ixgbe 0000:05:00.0: MAC: 2, PHY: 1, PBA No: H31656-000
ixgbe 0000:05:00.0: 90:e2:ba:c5:c9:38
ixgbe 0000:05:00.0: Intel(R) 10 Gigabit Network Connection
```
That's a bummer! Because there are two Tengig ports on this OCP, and this chip is a PCIe v2.0 device
which means the PCI encoding will be 8b/10b which means each lane can deliver about 80% of the
5.0GT/s, and 80% of 20GT/s is 16.0Gbit. By the way, when PCIe v3.0 was released, not only did the
transfer speed go to 8.0GT/s per lane, the encoding also changed to 128b/130b which lowers the
overhead from a whopping 20% to only 1.5%. It's not a bad investment of time to read up on PCI
Express standards on [[Wikipedia](https://en.wikipedia.org/wiki/PCI_Express)], as PCIe limitations
and blocked lanes (like in this case!) are the number one reason for poor VPP performance, as my
buddy Sander also noted during my NLNOG talk last year.
#### Intel X520: Loadtest Results
Now that I've shown a few of these runtime statistics, I think it's good to review three pertinent graphs.
I proceed to hook up the loadtester to the 10G ports of the R86S unit that has the Intel X520-DA2
adapter. I'll run the same eight loadtests:
**{1514b,64b,64b-1Q,MPLS} x {unidirectional,bidirectional}**
In the table above, I showed the output of `show runtime` in the VPP debug CLI. These numbers are
also exported in a prometheus exporter. I wrote about that in this [[article]({% post_url
2023-04-09-vpp-stats %})]. In Grafana, I can draw these timeseries as graphs, and it shows me a lot
about where VPP is spending its time. Each _node_ in the directed graph counts how many vectors
(packets) it has seen, and how many CPU cycles it has spent doing its work.
{{< image src="/assets/r86s/grafana-vectors.png" alt="Grafana Vectors" >}}
In VPP, a graph of _vectors/sec_ means how many packets per second is the router forwarding. The
graph above is on a logarithmic scale, and I've annotated each of the eight loadtests in orange. The
first block of four are the ***U***_nidirectional_ tests and of course, higher values is better.
I notice that some of these loadtests ramp up until a certain point, after which they become a <span
style='color:orange;font-weight:bold;'>flatline</span>, which I drew orange arrows for. The first
time this clearly happens is in the ***U3*** loadtest. It makes sense to me, because having one flow
implies only one worker thread, whereas in the ***U2*** loadtest the system can make use of multiple
receive queues and therefore multiple worker threads. It stands to reason that ***U2*** has a
slightly better performance than ***U3***.
The fourth test, the _MPLS_ loadtest, is forwarding the same identical packets with label 16, out on
another interface with label 17. They are therefore also single flow, and this explains why the
***U4*** loadtest looks very similar to the ***U3*** one. Some NICs can hash MPLS traffic to
multiple receive queues based on the inner payload, but I conclude that the Intel X520-DA2 aka
82599ES cannot do that.
The second block of four are the ***B***_idirectional_ tests. Similar to the tests I did with the
i226-V 2.5GbE NICs, here each of the network cards has to both receive traffic as well as sent
traffic. It is with this graph that I can determine the overall throughput in packets/sec
of these network interfaces. Of course the bits/sec and packets/sec also come from the T-Rex
loadtester output JSON. Here they are, for the Intel X520-DA2:
| ***Intel 82599ES***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|---------------|----------|-------------|-----------|
| U1: Unidirectional 1514b | 9.77Gbps | 809kpps | 99.2% |
| U2: Unidirectional 64b Multi | 6.48Gbps | 13.4Mpps | 90.1% |
| U3: Unidirectional 64b Single | 3.73Gbps | 7.77Mpps | 52.2% |
| U4: Unidirectional 64b MPLS | 3.32Gbps | 6.91Mpps | 46.4% |
| B1: Bidirectional 1514b | 12.9Gbps | 1.07Mpps | 65.6% |
| B2: Bidirectional 64b Multi | 6.08Gbps | 12.7Mpps | 42.7% |
| B3: Bidirectional 64b Single | 6.25Gbps | 13.0Mpps | 43.7% |
| B4: Bidirectional 64b MPLS | 3.26Gbps | 6.79Mpps | 22.8% |
A few further observations:
1. ***U1***'s loadtest shows that the machine can sustain 10Gbps in one direction, while ***B1***
shows that bidirectional loadtests are not yielding twice as much throughput. This is very
likely because the PCIe 5.0GT/s x4 link is constrained to 16Gbps total throughput, while the
OCP NIC supports PCIe 5.0GT/s x8 (32Gbps).
1. ***U3***'s loadtest shows that one single CPU can do 7.77Mpps max, if it's the only CPU that is
doing work. This is likely because if it's the only thread doing work, it gets to use the
entire L2/L3 cache for itself.
1. ***U2***'s test shows that when multiple workers perform work, the throughput raises to
13.4Mpps, but this is not double that of a single worker. Similar to before, I think this is
because the threads now need to share the CPU's modest L2/L3 cache.
1. ***B3***'s loadtest shows that two CPU threads together can do 6.50Mpps each (for a total of
13.0Mpps), which I think is likely because each NIC now has to receive _and_ transit packets.
If you're reading this and think you have an alternative explanation, do let me know!
### Mellanox Cx3 (10GbE)
When VPP is doing its work, it typically asks DPDK (or other input types like virtio, AVF, or RDMA)
for a _list_ of packets, rather than one individual packet. It then brings these packets, called
_vectors_, through a directed acyclic graph inside of VPP. Each graph node does something specific to
the packets, for example in `ethernet-input`, the node checks what ethernet type each packet is (ARP,
IPv4, IPv6, MPLS, ...), and hands them off to the correct next node, such as `ip4-input` or `mpls-input`.
If VPP is idle, there may be only one or two packets in the list, which means every time the packets
go into a new node, a new chunk of code has to be loaded from working memory into the CPU's
instruction cache. Conversely, if there are many packets in the list, only the first packet may need
to pull things into the i-cache, the second through Nth packet will become cache hits and execute
_much_ faster. Moreover, some nodes in VPP make use of processor optimizations like _SIMD_ (single
instruction, multiple data), to save on clock cycles if the same operation needs to be executed
multiple times.
{{< image src="/assets/r86s/grafana-clocks.png" alt="Grafana Clocks" >}}
This graph shows the average CPU cycles per packet for each node. In the first three loadtests
(***U1***, ***U2*** and ***U3***), you can see four lines representing the VPP nodes `ip4-input`
`ip4-lookup`, `ip4-load-balance` and `ip4-rewrite`. In the fourth loadtest ***U4***, you can see
only three nodes: `mpls-input`, `mpls-lookup`, and `ip4-mpls-label-disposition-pipe` (where the MPLS
label '16' is swapped for outgoing label '17').
It's clear to me that when VPP has not many packets/sec to route (ie ***U1*** loadtest), that the
cost _per packet_ is actually quite high at around 200 CPU cycles per packet per node. But, if I
slam the VPP instance with lots of packets/sec (ie ***U3*** loadtest), that VPP gets _much_ more
efficient at what it does. What used to take 200+ cycles per packet, now only takes between 34-52
cycles per packet, which is a whopping 5x increase in efficiency. How cool is that?!
And with that, the Mellanox C3 loadtest completes, and the results are in:
| ***Mellanox MCX342A-XCCN***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|---------------|----------|-------------|-----------|
| U1: Unidirectional 1514b | 9.73Gbps | 805kpps | 99.7% |
| U2: Unidirectional 64b Multi | 1.11Gbps | 2.30Mpps | 15.5% |
| U3: Unidirectional 64b Single | 1.10Gbps | 2.27Mpps | 15.3% |
| U4: Unidirectional 64b MPLS | 1.10Gbps | 2.27Mpps | 15.3% |
| B1: Bidirectional 1514b | 18.7Gbps | 1.53Mpps | 94.9% |
| B2: Bidirectional 64b Multi | 1.54Gbps | 2.29Mpps | 7.69% |
| B3: Bidirectional 64b Single | 1.54Gbps | 2.29Mpps | 7.69% |
| B4: Bidirectional 64b MPLS | 1.54Gbps | 2.29Mpps | 7.69% |
Here's something that I find strange though. VPP is clearly not saturated by these 64b loadtests. I
know this, because in the case of the Intel X520-DA2 above, I could easily see 13Mpps in a
bidirectional test, yet with this Mellanox Cx3 card, no matter if I do one direction or both
directions, the max packets/sec tops at 2.3Mpps only -- that's an order of magnitude lower.
Looking at VPP, both worker threads (the one reading from Port 5/0/0, and the other reading from
Port 5/0/1), are not very busy at all. If a VPP worker thread is saturated, this typically shows as
a vectors/call of 256.00 and 100% of CPU cycles consumed. But here, that's not the case at all, and
most time is spent in DPDK waiting for traffic:
```
Thread 1 vpp_wk_0 (lcore 1)
Time 31.2, 10 sec internal node vector rate 2.26 loops/sec 988626.15
vector rates in 1.1521e6, out 1.1521e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
FiftySixGigabitEthernet5/0/1-o active 15949560 35929200 0 8.39e1 2.25
FiftySixGigabitEthernet5/0/1-t active 15949560 35929200 0 2.59e2 2.25
dpdk-input polling 36250611 35929200 0 6.55e2 .99
ethernet-input active 15949560 35929200 0 2.69e2 2.25
ip4-input-no-checksum active 15949560 35929200 0 1.01e2 2.25
ip4-load-balance active 15949560 35929200 0 7.64e1 2.25
ip4-lookup active 15949560 35929200 0 9.26e1 2.25
ip4-rewrite active 15949560 35929200 0 9.28e1 2.25
unix-epoll-input polling 35367 0 0 1.29e3 0.00
---------------
Thread 2 vpp_wk_1 (lcore 2)
Time 31.2, 10 sec internal node vector rate 2.43 loops/sec 659534.38
vector rates in 1.1517e6, out 1.1517e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
FiftySixGigabitEthernet5/0/0-o active 14845221 35913927 0 8.66e1 2.42
FiftySixGigabitEthernet5/0/0-t active 14845221 35913927 0 2.72e2 2.42
dpdk-input polling 23114538 35913927 0 6.99e2 1.55
ethernet-input active 14845221 35913927 0 2.65e2 2.42
ip4-input-no-checksum active 14845221 35913927 0 9.73e1 2.42
ip4-load-balance active 14845220 35913923 0 7.17e1 2.42
ip4-lookup active 14845221 35913927 0 9.03e1 2.42
ip4-rewrite active 14845221 35913927 0 8.97e1 2.42
unix-epoll-input polling 22551 0 0 1.37e3 0.00
```
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
I kind of wonder why that is. Is the Mellanox Connect-X3 such a poor performer? Or does it not like
small packets? I've read online that Mellanox cards do some form of message compression on the PCI
bus, something perhaps to turn off. I don't know, but I don't like it!
### Mellanox Cx5 (25GbE)
VPP has a few _polling_ nodes, which are pieces of code that execute back-to-back in a tight
execution loop. A classic example of a _polling_ node is a _Poll Mode Driver_ from DPDK: this will
ask the network cards if they have any packets, and if so: marshall them through the directed graph
of VPP. As soon as that's done, the node will immediately ask again. If there is no work to do, this
turns into a tight loop with DPDK continuously asking for work. There is however another, lesser
known, _polling_ node: `unix-epoll-input`. This node services a local pool of file descriptors, like
the _Linux Control Plane_ netlink socket for example, or the clients attached to the Statistics
segment, CLI or API. You can see the open files with `show unix files`.
This design explains why the CPU load of a typical DPDK application is 100% of each worker thread. As
an aside, you can ask the PMD to start off in _interrupt_ mode, and only after a certain load switch
seemlessly to _polling_ mode. Take a look at `set interface rx-mode` on how to change from _polling_
to _interrupt_ or _adaptive_ modes. For performance reasons, I always leave the node in _polling_
mode (the default in VPP).
{{< image src="/assets/r86s/grafana-cpu.png" alt="Grafana CPU" >}}
The stats segment shows how many clock cycles are being spent in each call of each node. It also
knows how often nodes are called. Considering the `unix-epoll-input` and `dpdk-input` nodes will
perform what is essentially a tight-loop, the CPU should always add up to 100%. I found that one
cool way to show how busy a VPP instance really is, is to look over all CPU threads, and sort
through the fraction of time spent in each node:
* ***Input Nodes***: are those which handle the receive path from DPDK and into the directed graph
for routing -- for example `ethernet-input`, then `ip4-input` through to `ip4-lookup` and finally
`ip4-rewrite`. This is where VPP usually spends most of its CPU cycles.
* ***Output Nodes***: are those which handle the transmit path into DPDK. You'll see these are
nodes whose name ends in `-output` or `-tx`. You can also see that in ***U2***, there are only
two nodes consuming CPU, while in ***B2*** there are four nodes (because two interfaces are
transmitting!)
* ***epoll***: the _polling_ node called `unix-epoll-input` depicted in <span
style='color:brown;font-weight:bold'>brown</span> in this graph.
* ***dpdk***: the _polling_ node called `dpdk-input` depicted in <span
style='color:green;font-weight:bold'>green</span> in this graph.
If there is no work to do, as was the case at around 20:30 in the graph above, the _dpdk_ and
_epoll_ nodes are the only two that are consuming CPU. If there's lots of work to do, as was the
case in the unidirectional 64b loadtest between 19:40-19:50, and the bidirectional 64b loadtest
between 20:45-20:55, I can observe lots of other nodes doing meaningful work, ultimately starving
the _dpdk_ and _epoll_ threads until an equilibrium is achieved. This is how I know the VPP process
is the bottleneck and not, for example, the PCI bus.
I let the eight loadtests run, and make note of the bits/sec and packets/sec for each, in this table
for the Mellanox Cx5:
| ***Mellanox MCX542_ACAT***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate |
|---------------|----------|-------------|-----------|
| U1: Unidirectional 1514b | 24.2Gbps | 2.01Mpps | 98.6% |
| U2: Unidirectional 64b Multi | 7.43Gbps | 15.5Mpps | 41.6% |
| U3: Unidirectional 64b Single | 3.52Gbps | 7.34Mpps | 19.7% |
| U4: Unidirectional 64b MPLS | 7.34Gbps | 15.3Mpps | 46.4% |
| B1: Bidirectional 1514b | 24.9Gbps | 2.06Mpps | 50.4% |
| B2: Bidirectional 64b Multi | 6.58Gbps | 13.7Mpps | 18.4% |
| B3: Bidirectional 64b Single | 3.15Gbps | 6.55Mpps | 8.81% |
| B4: Bidirectional 64b MPLS | 6.55Gbps | 13.6Mpps | 18.3% |
Some observations:
1. This Mellanox Cx5 runs quite a bit hotter than the other two cards. It's a PCIe v3.0 which means
that despite there only being 4 lanes to the OCP port, it can achieve 31.504 Gbit/s (in case
you're wondering, this is 128b/130b encoding on 8.0GT/s x4).
1. It easily saturates 25Gbit in one direction with big packets in ***U1***, but as soon as smaller
packets are offered, each worker thread tops out at 7.34Mpps or so in ***U2***.
1. When testing in both directions, each thread can do about 6.55Mpps or so in ***B2***. Similar to
the other NICs, there is a clear slowdown due to CPU cache contention (when using multiple
threads), and RX/TX simultaneously (when doing bidirectional tests).
1. MPLS is a lot faster -- nearly double based on the use of multiple threads. I think this is
because the Cx5 has a hardware hashing function for MPLS packets that looks at the inner
payload to sort the traffic into multiple queues, while the Cx3 and Intel X520-DA2 do not.
## Summary and closing thoughts
There's a lot to say about these OCP cards. While the Intel is cheap, the Mellanox Cx3 is a bit
quirky with its VPP enumeration, and the Mellanox Cx5 is a bit more expensive (and draws a fair bit
more power, coming in at 20W), but it do 25Gbit reasonably, it's pretty difficult to make a solid
recommendation. What I find interesting is the very low limit in packets/sec on 64b packets coming
from the Cx3, while at the same time there seems to be an added benefit in MPLS hashing that the
other two cards do not have.
All things considered, I think I would recommend the Intel x520-DA2 (based on the _Niantic_ chip,
Intel 82599ES, total machine coming in at 17W). It seems like it pairs best with the available CPU
on the machine. Maybe a Mellanox ConnectX-4 could be a good alternative though, hmmmm :)
Here's a few files I gathered along the way, in case they are useful:
* [[LSCPU](/assets/r86s/lscpu.txt)] - [[Likwid Topology](/assets/r86s/likwid-topology.txt)] -
[[DMI Decode](/assets/r86s/dmidecode.txt)] - [[LSBLK](/assets/r86s/lsblk.txt)]
* Mellanox Cx341: [[dmesg](/assets/r86s/dmesg-cx3.txt)] - [[LSPCI](/assets/r86s/lspci-cx3.txt)] -
[[LSHW](/assets/r86s/lshw-cx3.txt)] - [[VPP Patch](/assets/r86s/vpp-cx3.patch)]
* Mellanox Cx542: [[dmesg](/assets/r86s/dmesg-cx5.txt)] - [[LSPCI](/assets/r86s/lspci-cx5.txt)] -
[[LSHW](/assets/r86s/lshw-cx5.txt)]
* Intel X520-DA2: [[dmesg](/assets/r86s/dmesg-x520.txt)] - [[LSPCI](/assets/r86s/lspci-x520.txt)] -
[[LSHW](/assets/r86s/lshw-x520.txt)]
* VPP Configs: [[startup.conf](/assets/r86s/vpp/startup.conf)] - [[L2 Config](/assets/r86s/vpp/config/l2.vpp)] -
[[L3 Config](/assets/r86s/vpp/config/l3.vpp)] - [[MPLS Config](/assets/r86s/vpp/config/mpls.vpp)]

View File

@ -0,0 +1,443 @@
---
date: "2024-08-03T10:51:23Z"
title: 'Review: Gowin 1U 2x25G (Alder Lake - N305)'
---
# Introduction
{{< image float="right" src="/assets/gowin-n305/gowin-logo.png" alt="Gowin logo" >}}
Last month, I took a good look at the Gowin R86S based on Jasper Lake (N6005) CPU
[[ref](https://www.gowinfanless.com/products/network-device/r86s-firewall-router/gw-r86s-u-series)],
which is a really neat little 10G (and, if you fiddle with it a little bit, 25G!) router that runs
off of USB-C power and can be rack mounted if you print a bracket. Check out my findings in this
[[article]({% post_url 2024-07-05-r86s %})].
David from Gowin reached out and asked me if I was willing to also take a look their Alder Lake
(N305) CPU, which comes in a 19" rack mountable chassis, running off of 110V/220V AC mains power,
but also with 2x25G ConnectX-4 network card. Why not! For critical readers: David sent me this
machine, but made no attempt to influence this article.
### Hardware Specs
{{< image width="500px" float="right" src="/assets/gowin-n305/case.jpg" alt="Gowin overview" >}}
There are a few differences between this 19" model and the compact mini-pc R86S. The most obvious
difference is the form factor. The R86S is super compact, not inherently rack mountable,
although I 3D printed a bracket for it. Looking inside, the motherboard is mostly obscured bya large
cooling block with fins that are flush with the top plate. There are 5 copper ports in the front: 2x
Intel i226-V (these are 2.5Gbit) and 3x Intel i210 (these are 1Gbit), and one of them offers POE,
which can be very handy to power a camera or wifi access point. A nice touch.
The Gowin server comes with an OCP v2.0 port, just like the R86S does. There's a custom bracket with
a ribbon cable to the motherboard, and in the bracket is housed a Mellanox ConnectX-4 LX 2x25Gbit
network card.
### A look inside
{{< image width="350px" float="right" src="/assets/gowin-n305/mobo.jpg" alt="Gowin mobo" >}}
The machine comes with an Intel i3-N305 (Alder Lake) CPU running at a max clock of 3GHz and 4x8GB of
LPDDR5 memory at 4800MT/s -- and considering the Alder Lake can make use of 4-channel memory, this
thing should be plenty fast. The memory is soldered to the board, though, so there's no option of
expanding or changing the memory after buying the unit.
Using `likwid-topology`, I determine that the 8-core CPU has no hyperthreads, but just straight up 8
cores with 32kB of L1 cache, two times 2MB of L3 cache (shared between cores 0-3, and another bank
shared between cores 4-7), and 6MB of L3 cache shared between all 8 cores. This is again a step up
from the Jasper Lake CPU, and should make VPP run a little bit faster.
What I find a nice touch is that Gowin has shipped this board with a 128GB MMC flash disk, which
appears in Linux as `/dev/mmcblk0` and can be used to install an OS. However, there are also two
NVME slots with M.2 2280, one M.SATA slot and two additional SATA slots with 4-pin power. On the
side of the chassis is a clever bracket that holds three 2.5" SSDs in a staircase configuration.
That's quite a lot of storage options, and given the CPU has some oompf, this little one could
realistically be a NAS, although I'd prefer it to be a VPP router!
{{< image width="350px" float="right" src="/assets/gowin-n305/ocp-ssd.jpg" alt="Gowin ocp-ssd" >}}
The copper RJ45 ports are all on the motherboard, and there's an OCP breakout port that fits any OCP
v2.0 network card. Gowin shipped it with a ConnectX-4 LX, but since I had a ConnectX-5 EN, I will
take a look at performance with both cards. One critical observation, as with the Jasper Lake R86S,
is that there are only 4 PCIe v3.0 lanes routed to the OCP, which means that the spiffy x8 network
interfaces (both the Cx4 and the Cx5 I have here) will run at half speed. Bummer!
The power supply is a 100-240V switching PSU with about 150W of power available. When running idle,
with one 1TB NVME drive, I measure 38.2W on the 220V side. When running VPP at full load, I measure
47.5W of total load. That's totally respectable for a 2x 25G + 2x 2.5G + 3x 1G VPP router.
I've added some pictures to a [[Google Photos](https://photos.app.goo.gl/rbd9xJBUUcnCgW7v9)] album,
if you'd like to take a look.
## VPP Loadtest: RDMA versus DPDK
You (hopefully>) came here to read about VPP stuff. For years now, I have been curious as to the
performance and functional differences in VPP between using DPDK and the native RDMA driver support
that Mellanox network cards have support for. In this article, I'll do four loadtests, with the
stock Mellanox Cx4 that comes with the Gowin server, and with the Mellanox Cx5 card that I had
bought for the R86S. I'll take a look at the differences betwen DPDK on the one hand and RDMA on the
other. This will yield, for me at least, a better understanding on the differences. Spoiler: there
are not many!
## DPDK
{{< image float="right" src="/assets/gowin-n305/dpdk.png" alt="DPDK logo" >}}
The Data Plane Development Kit (DPDK) is an open source software project managed by the Linux
Foundation. It provides a set of data plane libraries and network interface controller polling-mode
drivers for offloading ethernet packet processing from the operating system kernel to processes
running in user space. This offloading achieves higher computing efficiency and higher packet
throughput than is possible using the interrupt-driven processing provided in the kernel.
You can read more about it on [[Wikipedia](https://en.wikipedia.org/wiki/Data_Plane_Development_Kit)]
or on the [[DPDK Homepage](https://dpdk.org/)]. VPP uses DPDK as one of the (more popular) drivers
for network card interaction.
### DPDK: ConnectX-4 Lx
This is the OCP network card that came with the Gowin server. It identifies in Linux as:
```
0e:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
0e:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
```
Albeit with an important warning in `dmesg`, about the lack of PCIe lanes:
```
[3.704174] pci 0000:0e:00.0: [15b3:1015] type 00 class 0x020000
[3.708154] pci 0000:0e:00.0: reg 0x10: [mem 0x60e2000000-0x60e3ffffff 64bit pref]
[3.716221] pci 0000:0e:00.0: reg 0x30: [mem 0x80d00000-0x80dfffff pref]
[3.724079] pci 0000:0e:00.0: Max Payload Size set to 256 (was 128, max 512)
[3.732678] pci 0000:0e:00.0: PME# supported from D3cold
[3.736296] pci 0000:0e:00.0: reg 0x1a4: [mem 0x60e4800000-0x60e48fffff 64bit pref]
[3.756916] pci 0000:0e:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link
at 0000:00:1d.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)
```
With a PCIe v3.0 overhead of 130b/128b, that means the card will have (128/130) * 32 = 31.508 Gbps
available and I'm actually not quite sure why the kernel claims 31.504G in the log message. Anyway,
the card itself works just fine at this speed, and is immediately detected in DPDK while continuing
to use the `mlx5_core` driver. This would be a bit different with Intel based cards, as there the
driver has to be rebound to `vfio_pci` or `uio_pci_generic`. Here, the NIC itself remains visible
(and usable!) in Linux, which is kind of neat.
I do my standard set of eight loadtests: {unidirectional,bidirectional} x {1514b, 64b multiflow, 64b
singleflow, MPLS}. This teaches me a lot about how the NIC uses flow hashing, and what it's maximum
performance is. Without further ado, here's the results:
| Loadtest: Gowin CX4 DPDK | L1 bits/sec | Packets/sec | % of Line |
|----------------------------------|-------------|-------------|-----------|
| 1514b-unidirectional | 25.00 Gbps | 2.04 Mpps | 100.2 % |
| 64b-unidirectional | 7.43 Gbps | 11.05 Mpps | 29.7 % |
| 64b-single-unidirectional | 3.09 Gbps | 4.59 Mpps | 12.4 % |
| 64b-mpls-unidirectional | 7.34 Gbps | 10.93 Mpps | 29.4 % |
| 1514b-bidirectional | 22.63 Gbps | 1.84 Mpps | 45.2 % |
| 64b-bidirectional | 7.42 Gbps | 11.04 Mpps | 14.8 % |
| 64b-single-bidirectional | 5.33 Gbps | 7.93 Mpps | 10.7 % |
| 64b-mpls-bidirectional | 7.36 Gbps | 10.96 Mpps | 14.8 % |
Some observations:
* In the large packet department, the NIC easily saturates the port speed in _unidirectional_, and
saturates the PCI bus (x4) in _bidirectional_ forwarding. I'm surprised that the bidirectional
forwarding capacity is a bit lower (1.84Mpps versus 2.04Mpps).
* The NIC is using three queues, and the difference between single flow (which could only use one
queue, and one CPU thread) is not exactly linear (4.59Mpps vs 11.05Mpps for 3 RX queues)
* The MPLS performance is higher than single flow, which I think means that the NIC is capable of
hashing the packets based on the _inner packet_. Otherwise, while using the same MPLS label, the
Cx3 and other NICs tend to just leverage only one receive queue.
I'm very curious how this NIC stacks up between DPDK and RDMA -- read on below for my results!
### DPDK: ConnectX-5 EN
I swap the card out of its OCP bay and replace it with a ConnectX-5 EN that I have from when I
tested the [[R86S]({% post_url 2024-07-05-r86s %})]. It identifies as:
```
0e:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
0e:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
```
And similar to the ConnectX-4, this card also complains about PCIe bandwidth:
```
[6.478898] mlx5_core 0000:0e:00.0: firmware version: 16.25.4062
[6.485393] mlx5_core 0000:0e:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link
at 0000:00:1d.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)
[6.816156] mlx5_core 0000:0e:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
[6.841005] mlx5_core 0000:0e:00.0: Port module event: module 0, Cable plugged
[7.023602] mlx5_core 0000:0e:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[7.177744] mlx5_core 0000:0e:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
```
With that said, the loadtests are quite a bit more favorable for the newer ConnectX-5:
| Loadtest: Gowin CX5 DPDK | L1 bits/sec | Packets/sec | % of Line |
|----------------------------------|-------------|-------------|-----------|
| 1514b-unidirectional | 24.98 Gbps | 2.04 Mpps | 99.7 % |
| 64b-unidirectional | 10.71 Gbps | 15.93 Mpps | 42.8 % |
| 64b-single-unidirectional | 4.44 Gbps | 6.61 Mpps | 17.8 % |
| 64b-mpls-unidirectional | 10.36 Gbps | 15.42 Mpps | 41.5 % |
| 1514b-bidirectional | 24.70 Gbps | 2.01 Mpps | 49.4 % |
| 64b-bidirectional | 14.58 Gbps | 21.69 Mpps | 29.1 % |
| 64b-single-bidirectional | 8.38 Gbps | 12.47 Mpps | 16.8 % |
| 64b-mpls-bidirectional | 14.50 Gbps | 21.58 Mpps | 29.1 % |
Some observations:
* The NIC also saturates 25G in one direction with large packets, and saturates the PCI bus when
pushing in both directions.
* Single queue / thread operation at 6.61Mpps is a fair bit higher than Cx4 (which is 4.59Mpps)
* Multiple threads scale almost linearly, from 6.61Mpps in 1Q to 15.93Mpps in 3Q. That's
respectable!
* Bidirectional small packet performance is pretty great at 21.69Mpps, more than double that of
the Cx4 (which is 11.04Mpps).
* MPLS rocks! The NIC forwards 21.58Mpps of MPLS traffic.
One thing I should note, is that at this point, the CPUs are not fully saturated. Looking at
Prometheus/Grafana for this set of loadtests:
{{< image src="/assets/gowin-n305/cx5-cpu.png" alt="Cx5 CPU" >}}
What I find interesting is that in no cases did any CPU thread run to 100% utilization. In the 64b
single flow loadtests (from 14:00-14:10 and from 15:05-15:15), the CPU threads definitely got close,
but they did not clip -- which does lead me to believe that the NIC (or the PCIe bus!) are the
bottleneck.
By the way, the bidirectional single flow 64b loadtest shows two threads that have an overall
slightly _lower_ utilization (63%) versus the unidirectional single flow 64 loadtest (at 78.5%). I
think this can be explained by the two threads being able to use/re-use each others' cache lines.
***Conclusion***: ConnectX-5 performs significantly better than ConnectX-4 with DPDK.
## RDMA
{{< image float="right" src="/assets/gowin-n305/rdma.png" alt="RDMA artwork" >}}
RDMA supports zero-copy networking by enabling the network adapter to transfer data from the wire
directly to application memory or from application memory directly to the wire, eliminating the need
to copy data between application memory and the data buffers in the operating system. Such transfers
require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel
with other system operations. This reduces latency in message transfer.
You can read more about it on [[Wikipedia](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]
VPP uses RDMA in a clever way, relying on the Linux library for rdma-core (libibverb) to create a
custom userspace poll-mode driver, specifically for Ethernet packets. Despite using the RDMA APIs,
this is not about RDMA (no Infiniband, no RoCE, no iWARP), just pure traditional Ethernet packets.
Many VPP developers recommend and prefer RDMA for Mellanox devices. I myself have been more
comfortable with DPDK. But, now is the time to _FAFO_.
### RDMA: ConnectX-4 Lx
Considering I used three RX queues for DPDK, I instruct VPP now to use 3 receive queues for RDMA as
well. I remove the `dpdk_plugin.so` from `startup.conf`, although I could also have kept the DPDK
plugin running (to drive the 1.0G and 2.5G ports!) and de-selected the `0000:0e:00.0` and
`0000:0e:00.1` PCI entries, so that the RDMA driver can grab them.
The VPP startup now looks like this:
```
vpp# create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 512 tx-queue-size 512
num-rx-queues 3 no-multi-seg no-striding max-pktlen 2026
vpp# create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 512 tx-queue-size 512
num-rx-queues 3 no-multi-seg no-striding max-pktlen 2026
vpp# set int mac address xxv0 02:fe:4a:ce:c2:fc
vpp# set int mac address xxv1 02:fe:4e:f5:82:e7
```
I realize something pretty cool - the RDMA interface gets an ephemeral (randomly generated) MAC
address, while the main network card in Linux stays available. The NIC internally has a hardware
filter for the RDMA bound MAC address and gives it to VPP -- the implication is that the 25G NICs
can *also* be used in Linux itself. That's slick.
Performance wise:
| Loadtest: Gowin CX4 with RDMA | L1 bits/sec | Packets/sec | % of Line |
|----------------------------------|-------------|-------------|-----------|
| 1514b-unidirectional | 25.01 Gbps | 2.04 Mpps | 100.3 % |
| 64b-unidirectional | 12.32 Gbps | 18.34 Mpps | 49.1 % |
| 64b-single-unidirectional | 6.21 Gbps | 9.24 Mpps | 24.8 % |
| 64b-mpls-unidirectional | 11.95 Gbps | 17.78 Mpps | 47.8 % |
| 1514b-bidirectional | 26.24 Gbps | 2.14 Mpps | 52.5 % |
| 64b-bidirectional | 14.94 Gbps | 22.23 Mpps | 29.9 % |
| 64b-single-bidirectional | 11.53 Gbps | 17.16 Mpps | 23.1 % |
| 64b-mpls-bidirectional | 14.99 Gbps | 22.30 Mpps | 30.0 % |
Some thoughts:
* The RDMA driver is significantly _faster_ than DPDK in this configuration. Hah!
* 1514b are fine in both directions, RDMA slightly outperforms DPDK in the bidirectional test.
* 64b is massively faster:
* Unidirectional multiflow: RDMA 18.34Mpps, DPDK 11.05Mpps
* Bidirectional multiflow: RDMA 22.23Mpps, DPDK 11.04Mpps
* Bidirectional MPLS: RDMA 22.30Mpps, DPDK 10.93Mpps.
***Conclusion***: I would say, roughly speaking, that RDMA outperforms DPDK on the Cx4 by a factor
of two. That's really cool, especially because ConnectX-4 network cards are found very cheap these
days.
### RDMA: ConnectX-5 EN
Well then what about the newer Mellanox ConnectX-5 card? Something surprising happens when I boot
the machine and start the exact same configuration as with the Cx4, the loadtests results almost
invariably suck:
| Loadtest: Gowin CX5 with RDMA | L1 bits/sec | Packets/sec | % of Line |
|----------------------------------|-------------|-------------|-----------|
| 1514b-unidirectional | 24.95 Gbps | 2.03 Mpps | 99.6 % |
| 64b-unidirectional | 6.19 Gbps | 9.22 Mpps | 24.8 % |
| 64b-single-unidirectional | 3.27 Gbps | 4.87 Mpps | 13.1 % |
| 64b-mpls-unidirectional | 6.18 Gbps | 9.20 Mpps | 24.7 % |
| 1514b-bidirectional | 24.59 Gbps | 2.00 Mpps | 49.2 % |
| 64b-bidirectional | 8.84 Gbps | 13.15 Mpps | 17.7 % |
| 64b-single-bidirectional | 5.57 Gbps | 8.29 Mpps | 11.1 % |
| 64b-mpls-bidirectional | 8.77 Gbps | 13.05 Mpps | 17.5 % |
Yikes! Cx5 in its default mode can still saturate 1514b loadtests, but turns into single digit with
almost all other loadtest types. I'm surprised also that single flow loadtest clocks in at only
4.87Mpps, that's about the same speed I sawwith the ConnectX-4 using DPDK. This does not look good
at all, and honestly, I don't believe it.
So I start fiddling with settings.
#### ConnectX-5 EN: Tuning Parameters
There are a few things I found that might speed up processing in the ConnectX network card:
1. Allowing for larger PCI packets - by default 512b, I can raise this to 1k, 2k or even 4k.
`setpci -s 0e:00.0 68.w` will return some hex number ABCD, the A here stands for max read size.
0=128b, 1=256b, 2=512b, 3=1k, 4=2k, 8=4k. I can set the value by writing `setpci -s 0e:00.0
68.w=3BCD`, which immediately speeds up the loadtests!
1. Mellanox recommends to turn on CQE compression, to allow for the PCI messages to be aggressively
compressed, saving bandwidth. This helps specifically with _smaller_ packets, as the PCI message
overhead really starts to matter. `mlxconfig -d 0e:00.0 set CQE_COMPRESSION=1` and reboot.
1. For MPLS, the Cx5 can do flow matching on the inner packet (rather than hashing all packets to
the same queue based on the MPLS label) -- `mlxconfig -d 0e:00.0 set
FLEX_PARSER_PROFILE_ENABLE=1` and reboot.
1. Likely the number of receive queues matters, and can be set in the `create interface rdma`
command.
I notice that CQE_COMPRESSION and FLEX_PARSER_PROFILE_ENABLE help in all cases, so I set them and
reboot. The PCI packets resizing also helps specifically with smaller packets, so I set that too in
`/etc/rc.local`. The fourth variable is left over, which is varying receive queue count.
Here's a comparison that, to me at least, was surprising. With three receive queues, thus three
CPU threads each receiving 4.7Mpps and sending 3.1Mpps, performance looked like this:
```
$ vppctl create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 1024 tx-queue-size 4096
num-rx-queues 3 mode dv no-multi-seg max-pktlen 2026
$ vppctl create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 1024 tx-queue-size 4096
num-rx-queues 3 mode dv no-multi-seg max-pktlen 2026
$ vppctl show run | grep vector\ rates | grep -v in\ 0
vector rates in 4.7586e6, out 3.2259e6, drop 3.7335e2, punt 0.0000e0
vector rates in 4.9881e6, out 3.2206e6, drop 3.8344e2, punt 0.0000e0
vector rates in 5.0136e6, out 3.2169e6, drop 3.7335e2, punt 0.0000e0
```
This is fishy - why is the inbound rate much higher than the outbound rate? The behavior is
consistent in multi-queue setups. If I create 2 queues it's 8.45Mpps in and 7.98Mpps out:
```
$ vppctl create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 1024 tx-queue-size 4096
num-rx-queues 2 mode dv no-multi-seg max-pktlen 2026
$ vppctl create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 1024 tx-queue-size 4096
num-rx-queues 2 mode dv no-multi-seg max-pktlen 2026
$ vppctl show run | grep vector\ rates | grep -v in\ 0
vector rates in 8.4533e6, out 7.9804e6, drop 0.0000e0, punt 0.0000e0
vector rates in 8.4517e6, out 7.9798e6, drop 0.0000e0, punt 0.0000e0
```
And when I create only one queue, the same appears:
```
$ vppctl create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 1024 tx-queue-size 4096
num-rx-queues 1 mode dv no-multi-seg max-pktlen 2026
$ vppctl create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 1024 tx-queue-size 4096
num-rx-queues 1 mode dv no-multi-seg max-pktlen 2026
$ vppctl show run | grep vector\ rates | grep -v in\ 0
vector rates in 1.2082e7, out 9.3865e6, drop 0.0000e0, punt 0.0000e0
```
But now that I've scaled down to only one queue (and thus one CPU thread doing all the work), I
manage to find a clue in the `show runtime` command:
```
Thread 1 vpp_wk_0 (lcore 1)
Time 321.1, 10 sec internal node vector rate 256.00 loops/sec 46813.09
vector rates in 1.2392e7, out 9.4015e6, drop 0.0000e0, punt 1.5571e-2
Name State Calls Vectors Suspends Clocks Vectors/Call
ethernet-input active 15543357 3979099392 0 2.79e1 256.00
ip4-input-no-checksum active 15543352 3979098112 0 1.26e1 256.00
ip4-load-balance active 15543357 3979099387 0 9.17e0 255.99
ip4-lookup active 15543357 3979099387 0 1.43e1 255.99
ip4-rewrite active 15543357 3979099387 0 1.69e1 255.99
rdma-input polling 15543357 3979099392 0 2.57e1 256.00
xxv1-output active 15543357 3979099387 0 5.03e0 255.99
xxv1-tx active 15543357 3018807035 0 4.35e1 194.22
```
It takes a bit of practice to spot this, but see how `xx1-output` is running at 256 vectors/call,
while `xxv1-tx` is running at only 194.22 vectors/call? That means that VPP is dutifully handling
the whole packet, but when it is handed off to RDMA to marshall onto the hardware, it's getting
lost! And indeed, this is corroborated by `show errors`:
```
$ vppctl show err
Count Node Reason Severity
3334 null-node blackholed packets error
7421 ip4-arp ARP requests throttled info
3 ip4-arp ARP requests sent info
1454511616 xxv1-tx no free tx slots error
16 null-node blackholed packets error
```
Wow, billions of packets have been routed by VPP but then had to be discarded because RDMA output
could not keep up. Ouch.
Compare the previous CPU utilization graph (from the Cx5/DPDK loadtest) with this Cx5/RDMA/1-RXQ
loadtest:
{{< image src="/assets/gowin-n305/cx5-cpu-rdma1q.png" alt="Cx5 CPU with 1Q" >}}
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
Here I can clearly see that the one CPU thread (in yellow for unidirectional) and the two CPU
therads (one for each of the bidirectional flows) jump up to 100% and stay there. This means that
when VPP is completely pegged, it is receiving 12.4Mpps _per core_, but only manages to get RDMA to
send 9.40Mpps of those on the wire. The performance further deteriorates when multiple receive queues
are in play. Note: 12.4Mpps is pretty great for these CPU threads.
***Conclusion***: Single Queue RDMA based Cx5 will allow for about 9Mpps per interface, which is a
little bit better than DPDK performance; but Cx4 and Cx5 performance are not too far apart.
## Summary and closing thoughts
Looking at the RDMA results for both Cx4 and Cx5, when using only one thread, gives fair performance
with very low CPU cost per port -- however I could not manage to get rid of the `no free tx slots`
errors, and VPP can consume / process / forward more packets than RDMA is willing to marshall out on
the wire, which is disappointing.
That said, both RDMA and DPDK performance is line rate at 25G unidirectional with sufficiently large
packets, and for small packets, it can realistically handle roughly 9Mpps per CPU thread.
Considering the CPU has 8 threads -- of which 6 usable by VPP -- the machine has more CPU than it
needs to drive the NICs. It should be a really great router at 10Gbps traffic rates, and a very fair
router at 25Gbps either with RDMA or DPDK.
Here's a few files I gathered along the way, in case they are useful:
* [[LSCPU](/assets/gowin-n305/lscpu.txt)] - [[Likwid Topology](/assets/gowin-n305/likwid-topology.txt)] -
[[DMI Decode](/assets/gowin-n305/dmidecode.txt)] - [[LSBLK](/assets/gowin-n305/lsblk.txt)]
* Mellanox MCX4421A-ACAN: [[dmesg](/assets/gowin-n305/dmesg-cx4.txt)] - [[LSPCI](/assets/gowin-n305/lspci-cx4.txt)] -
[[LSHW](/assets/gowin-n305/lshw-cx4.txt)]
* Mellanox MCX542B-ACAN: [[dmesg](/assets/gowin-n305/dmesg-cx5.txt)] - [[LSPCI](/assets/gowin-n305/lspci-cx5.txt)] -
[[LSHW](/assets/gowin-n305/lshw-cx5.txt)]
* VPP Configs: [[startup.conf](/assets/gowin-n305/vpp/startup.conf)] - [[L2 Config](/assets/gowin-n305/vpp/config/l2.vpp)] -
[[L3 Config](/assets/gowin-n305/vpp/config/l3.vpp)] - [[MPLS Config](/assets/gowin-n305/vpp/config/mpls.vpp)]